English is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find datasets with a specific dialect or type of speech to train your models.
That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best English Language speech datasets.
Are you ready?
Let’s dive into our list of the best English Language speech datasets in 2022.
Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here.
Here are our top picks for English Language speech datasets:
1. Biggest Non-Commercial English Language Speech Dataset
The People’s Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset.
Features:
- Licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset).
- A model trained on this dataset achieved a 9.98% word error rate on Librispeech’s test-clean test set.
- Data was collected via searching the Internet for appropriately licensed audio data with existing transcriptions.
Not quite your style? Check out these alternatives:
- If you’re looking for shorter snippets of data, the Speech Commands Dataset has 65,000 one-second utterances of 30 short words, by thousands of different members of the public.
- The Common Voice Dataset is also a fantastic resource for non-commercial use: over 500-hours of speech recordings from a variety of resources, including old movies, books, and other speech media.
2. Best UK English Speech Dataset
Datatang’s British English Speech Dataset contains 831 hours of data of Mobile Phone conversations of adults of a wide range of ages speaking British English.
Features:
- 16kHz, 16bit, uncompressed wav, mono channel
- quiet indoor environment, low background noise, without echo
- 1,651 speakers totally, with 43% male and 57% female
Alternatives:
- The SAVE dataset consists of recordings from 4 male actors, in 7 different emotions – equating to 480 British utterances in total.
- The ACL Anthology Dataset has over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.
3. Best US English Speech Dataset
The Audiovisual Database of Spoken American English was developed at Butler University, Indianapolis, IN in 2007 for use by a variety of researchers to evaluate speech production and speech recognition.
Features:
- All participants are native speakers of American English
- Participants were between 19 and 61 years of age (with a mean age of 30 years)
- Participants wore a Sennheiser MKE-2060 directional/cardioid lapel microphone throughout the recordings
Alternatives:
- The CALLHOME Speech Dataset features 120 unscripted 30-minute telephone conversations between native speakers of English. The calls took place from residents within the North America area, who called family members and friends.
- The Santa Barbara Corpus of Spoken American English Dataset is based on hundreds of recordings of natural speech (conversation/gossip/arguments, etc) from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds.
4. English Pronunciation Speech Datasets
The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations.
Features:
- Has 39 phonemes
Alternatives:
- LibriSpeech Dataset features 1000 hours corpus of read English speech (varied pronunciations)
- The EmoV_DB Dataset holds a database of emotional speech, pronounciation, and verbal cues. It contains data for both male and female English actors.
5. Best English Global Accents Speech Dataset
Speech Accent Archive contains 2140 speech samples, each from a different talker reading the same reading passage. This dataset allows you to compare the demographic and linguistic backgrounds of the speakers in order to determine which variables are key predictors of each accent. The speech accent archive demonstrates that accents are systematic rather than merely mistaken speech.
Features:
- Contains Native and non-native speakers of English
- Dataset contains 2,140 speech samples
- Participants come from 177 countries with 214 different native languages.
Alternatives:
- OpenSLR has a fantastic English Dialect Dataset. With over 17,500 high-quality audio recordings, native speakers in locations around England and Ireland self-reported in their own dialect.
- The Tatoeba English Speaking Dataset is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.
- The Outlier Detection Dataset consists of 3686 segments of English speech spoken with different accents.The majority data corresponds to American accent and only 1.65% corresponds to one of seven other accents (these are referred to as outliers).
Wrapping up
To conclude, here are top picks for the best English Language Speech datasets for your projects:
- Biggest Non-Commercial English Language Speech Dataset: The People’s Speech
- Best UK English Speech Dataset: Datatang’s British English Speech
- Best US English Speech Dataset: Audiovisual Database of Spoken American English
- English Pronunciation Speech Datasets: The Carnegie Mellon University Pronouncing Dictionary
- Best English Global Accents Speech Dataset: Speech Accent Archive
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.