Top English Language Speech Datasets of 2022

English is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find datasets with a specific dialect or type of speech to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best English Language speech datasets.

Are you ready?

Let’s dive into our list of the best English Language speech datasets in 2022.

Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here


Here are our top picks for English Language speech datasets:

1. Biggest Non-Commercial English Language Speech Dataset

The People’s Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset.

Features:

  • Licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset).
  • A model trained on this dataset achieved a 9.98% word error rate on Librispeech’s test-clean test set.
  • Data was collected via searching the Internet for appropriately licensed audio data with existing transcriptions.

Access the dataset

Not quite your style? Check out these alternatives:

  • If you’re looking for shorter snippets of data, the Speech Commands Dataset has 65,000 one-second utterances of 30 short words, by thousands of different members of the public.
  • The Common Voice Dataset is also a fantastic resource for non-commercial use: over 500-hours of speech recordings from a variety of resources, including old movies, books, and other speech media.

2. Best UK English Speech Dataset 

Datatang’s British English Speech Dataset contains 831 hours of data of Mobile Phone conversations of adults of a wide range of ages speaking British English.

Features:

  • 16kHz, 16bit, uncompressed wav, mono channel
  • quiet indoor environment, low background noise, without echo
  • 1,651 speakers totally, with 43% male and 57% female

Access the dataset

Alternatives:

  • The SAVE dataset consists of recordings from 4 male actors, in 7 different emotions – equating to 480 British utterances in total.
  • The ACL Anthology Dataset has over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.

3. Best US English Speech Dataset

The Audiovisual Database of Spoken American English was developed at Butler University, Indianapolis, IN in 2007 for use by a variety of researchers to evaluate speech production and speech recognition.

Features:

  • All participants are native speakers of American English
  • Participants were between 19 and 61 years of age (with a mean age of 30 years)
  • Participants wore a Sennheiser MKE-2060 directional/cardioid lapel microphone throughout the recordings

Access the dataset

Alternatives:

  • The CALLHOME Speech Dataset features 120 unscripted 30-minute telephone conversations between native speakers of English. The calls took place from residents within the North America area, who called family members and friends.
  • The Santa Barbara Corpus of Spoken American English Dataset is based on hundreds of recordings of natural speech (conversation/gossip/arguments, etc) from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. 

4. English Pronunciation Speech Datasets

The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations. 

Features:

  • Has 39 phonemes

Access the dataset

Alternatives:

  • LibriSpeech Dataset features 1000 hours corpus of read English speech (varied pronunciations)
  • The EmoV_DB Dataset holds a database of emotional speech, pronounciation, and verbal cues. It contains data for both male and female English actors.  

5. Best English Global Accents Speech Dataset

Speech Accent Archive contains 2140 speech samples, each from a different talker reading the same reading passage. This dataset allows you to compare the demographic and linguistic backgrounds of the speakers in order to determine which variables are key predictors of each accent. The speech accent archive demonstrates that accents are systematic rather than merely mistaken speech.

Features:

  • Contains Native and non-native speakers of English 
  • Dataset contains 2,140 speech samples
  • Participants come from 177 countries with 214 different native languages.

Access the dataset

Alternatives:

  • OpenSLR has a fantastic English Dialect Dataset. With over 17,500 high-quality audio recordings, native speakers in locations around England and Ireland self-reported in their own dialect.
  • The Tatoeba English Speaking Dataset is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.
  • The Outlier Detection Dataset consists of 3686 segments of English speech spoken with different accents.The majority data corresponds to American accent and only 1.65% corresponds to one of seven other accents (these are referred to as outliers).  


Wrapping up

To conclude, here are top picks for the best English Language Speech datasets for your projects:

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.