Speech datasets are among the most sought-after datasets by AI/ML professionals.
Despite their popularity, it’s not always easy to find speech datasets in the wild. As data is needed to train your models, it’s important you get the requirements right.
That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best speech datasets and provided an up-to-date archive. We’ve also categorized each dataset, so you can navigate through this list with ease.
Are you ready to find a dataset to suit your project?
Let’s dive in.
Here are our top picks for Speech Datasets:
Languages:
Czech Datasets
Holds multiple dataset topics including translation, grammatical error correction, NLP speech data, manual speech annotation, movie reviews, movie descriptions, and audio tuning data.
Hungarian Datasets
Holds multiple dataset topics including sentence analysis, single speaker speech data, text analysis, and monolingual data.
Greek Datasets
Holds multiple dataset topics including tweets, profanity text analysis, computation linguistics, emotional sentiment analysis, public-speaking, and single speaker data.
Malaysian Datasets
Holds multiple dataset topics including conversational speech, transcription, indoor environments, multilingual glossaries, bilingual index, translation, and sentiment value analysis.
Thai Datasets
Holds multiple dataset topics including human-annotation sentiment classification, conversational speech, text analysis, famous Thai food dishes, smart homes, vehicle data, and manual transcription.
Romanian Datasets
Holds multiple dataset topics including speech translation, transcriptions, TED talks, audio validation, culture, finance, politics, science, sports, technology, and monolingual data.
Burmese Datasets
Holds multiple dataset topics including sentence analysis, transcription, conversational speech, news articles, and textbook data.
Persian Datasets
Holds multiple dataset topics including news articles, author-extracted keyphrases, Instagram comment analysis, labeled text, and monolingual data.
Egyptian Datasets
Holds multiple dataset topics including phonological phenomena, tweet analysis, hieroglyphics, object detection, transcribed conversational speech, linguistic tags, and word relations.
Tamil Datasets
Holds multiple dataset topics including sentiment annotation, comment analysis, information-seeking tasks, native speaker annotation, and cross-lingual retrieval.
Korean Datasets
Holds multiple dataset topics including intention identification, single text utterances, conversational speech, labeled news comments, and toxic speech detection.
Bengali Datasets
Holds multiple dataset topics including transcribed audio, hate speech detection, document classification, sentiment analysis, single-hand gestures, social media/Wiki page resources, and NLP tasks.
Hindi Datasets
Holds multiple dataset topics including monolingual data, novels and short stories, sentence and special character texts, and Hindi captioning.
Mongolian Datasets
Holds multiple dataset topics including text content, annotated speaker identity, voiceprint recognition, speech recognition tasks, and monolingual data.
Bulgarian Datasets
Holds multiple dataset topics including questions from matriculation exams, online quizzes, resource grammar, and historical analysis.
Ukrainian Datasets
Holds multiple dataset topics including grammatical error correction, government surveys, treebank annotation, universal morphological analysis, and cross-linguistical analysis.
Dutch Datasets
Holds multiple dataset topics including book reviews, binary sentiment polarity label analysis, personality prediction, news articles, monolingual and multilingual NLP models, partnership detection, and word frequency.
Vietnamese Datasets
Holds multiple dataset topics including monolingual data, multiple choice questions and answers, reading comprehension, emotion recognition, image captioning, and gender/name prediction.
Turkish Datasets
Holds multiple dataset topics including natural language processing, transcriptions, everyday conversations, noise-reduction collections, and digital satellite transmissions.
Polish Datasets
Holds multiple dataset topics including annotated tweets, online reviews from medicine and hotel domains, question-answer pairs, news articles, and summaries, and linguistically analyzed documents.
Russian Datasets
Holds multiple dataset topics including single-speaker, public domain audiobooks, machine translations, and Wikidata.
Indonesian Datasets
Holds multiple dataset topics including automobile platforms, Wiki revision history, news websites, comments and reviews from online sources, Twitter texts, and monolingual data processing.
Japanese Datasets
Holds multiple dataset topics including linguistic phenomena, monolingual data, Japanese documentation, parallel sentence analysis, handwriting training, and machine reading comprehension.
Portuguese Datasets
Holds multiple dataset topics including instructional videos, Portuguese translation, public research, Twitter texts, and Facebook posts.
Arabic Datasets
Holds multiple dataset topics including YouTube content, handwriting analysis, dialect speech data, telephone conversations, and transcribed scripted speech.
German Datasets
Holds multiple dataset topics including image and text pairing, media transcriptions, news articles, monolingual data, novel citations, and emotion classification.
Indian Datasets
Holds multiple dataset topics including conversational speech training, dialect training, isolated word samples, and text data analysis.
French Datasets
Holds multiple dataset topics including speech style analysis, monolingual data, pronunciation transcription, telephone conversation, Wiki articles, and command and query speech.
Spanish Datasets
Holds multiple dataset topics including conversational speech, audio transcription, weather recordings, telephone conversation, and speech development analysis.
English Datasets
Holds multiple dataset topics including conversational speech, old movie speech data, telephone conversations, pronunciation transcription, speech recognition, linguistic speech analysis, and dialect speech data.
Dialects:
Mandarin Datasets
Holds multiple dataset topics including speech recognition, emotional speech analysis, YouTube and Podcast speech data, sentence transcription, automatic speech scoring, and news broadcasting speech.
Miscellaneous:
Natural Language Processing Datasets
Holds multiple dataset topics including audiobook passages, speech recognition, dialect speech analysis, detailed reading of the New Testament, multilingual speech-to-text translation, language classification, caption annotation, telephone conversations, and TedTalk transcripts.
Wrapping up
We hope that this list helped you find a dataset for your project. Hopefully, this has also made you realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then let us know here.
If you would like to find out more about building a custom dataset for your project, please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.