Top Spanish Language Speech Datasets of 2022

Spanish is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find datasets with a specific dialect or type of speech to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Spanish Language speech datasets.

Are you ready?

Let’s dive into our list of the best Spanish Language speech datasets in 2022.

Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here


Here are our top picks for Spanish Language speech datasets:

1. Biggest Non-Commercial Spanish Language Speech Dataset

This open-source dataset consists of 5.56 hours of transcribed Peninsular Spanish conversational speech on certain topics, where 17 conversations between four pairs of speakers were contained.

Features:

  • WAV format 16 kHz, 16 bits, mono
  • Recorded in quiet indoor environment
  • Spontaneous, themed conversations

Access the dataset

Not quite your style? Check out these alternatives:

  • The LibriVox Spanish Speech Dataset features 73 hours of read speech and transcripts. The audio is comprised of sentences from 300 books read by 154 native Spanish speakers (77 men and 77 women).
  • TEDx Spanish Corpus Dataset contains spontaneous speech of several expositors in TEDx events; most of them are men. This is a gender-unbalanced corpus of 24 hours of duration. 

2. Best Argentinian Spanish Speech Dataset 

Created by Google in 2018, the Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset Speech dataset contains about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers., in Spanish (Argentinian) language. Containing ~5,900 in Wav file format.

Features:

  • 16kHz, 16bit, uncompressed wav, mono channel
  • quiet indoor environment, low background noise, without echo
  • 1,651 speakers totally, with 43% male and 57% female

Access the dataset

Alternatives:

3. Best Colombian Spanish Speech Dataset

This dataset was collected for speech technology research from native Colombian Spanish speakers who volunteered to supply the data. 

Features:

  • Over 4,900 sentences
  • The audio is high quality (48kHz, 16 bit, mono, Wave audio)
  • Recorded in a quiet environment

Access the dataset

Alternatives:

4. Best Mexican Spanish Speech Datasets

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish broadcast speech with associated transcripts. 

Features:

  • Recorded in 16 kHz, 16-bit PCM flac format with transcripts presented as UTF-8 encoded plain text.
  • Gender-balanced participants.

Access the dataset

Alternatives:

  •  300-hours of Mexican Spanish Speech are in the Magic Data Speech Dataset. Daily-use sentences from both male and female speakers.
  • The Summa Linguae Mexican Speech Dataset contains recordings (69 utterances in total) of voice commands without a wake word in Mexican Spanish (es_MX) of 106 participants of age 16-65.

5. Best Basque Speech Dataset

This dataset was collected for speech technology research from native Basque speakers who volunteered to supply the data. The audio is high quality (48kHz, 16 bit, mono, Wave audio), recorded in a quiet environment.

Some quality checks have been done on the data, but there might still be mistranscriptions or artifacts in the audio.

Features:

  • Contains Native and non-native speakers of Basque 

Access the dataset

Alternatives:

6. Best Catalan Speech Dataset

This data set contains transcribed high-quality audio of Catalan sentences recorded by volunteers, separated into male and female audio files.  

Features:

  • Contains Native and non-native speakers of Catalan 

Access the dataset

Alternatives:

  • The Catalan Speech Database contains recordings of 550 adult Catalan speakers who uttered over 290 items (read and spontaneous). The data were recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place). 
  • The Castillian Spanish SpeechDat(II) FDB-1000 Dataset contains the recordings of 1,000 Castillian Spanish speakers (481 males, 519 females) recorded over the Spanish fixed telephone network.

Wrapping up

To conclude, here are top picks for the best Spanish Language Speech datasets for your projects:

  1. Biggest Non-Commercial Spanish Language Speech Dataset: Peninsular Spanish conversational speech
  2. Best Argentinian Spanish Speech Dataset: Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset Speech dataset
  3. Best Colombian Spanish Speech Dataset: Colombian Spanish [es-co] multi-speaker speech dataset
  4. Best Mexican Spanish Speech Dataset: CIEMPIESS Dataset
  5. Best Basque Speech Dataset: Basque Speech Dataset
  6. Best Catalan Speech Dataset: Catalan [ca-es] multi-speaker speech dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.