Spanish is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find datasets with a specific dialect or type of speech to train your models.
That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Spanish Language speech datasets.
Are you ready?
Let’s dive into our list of the best Spanish Language speech datasets in 2022.
Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here.
Here are our top picks for Spanish Language speech datasets:
1. Biggest Non-Commercial Spanish Language Speech Dataset
This open-source dataset consists of 5.56 hours of transcribed Peninsular Spanish conversational speech on certain topics, where 17 conversations between four pairs of speakers were contained.
Features:
- WAV format 16 kHz, 16 bits, mono
- Recorded in quiet indoor environment
- Spontaneous, themed conversations
Not quite your style? Check out these alternatives:
- The LibriVox Spanish Speech Dataset features 73 hours of read speech and transcripts. The audio is comprised of sentences from 300 books read by 154 native Spanish speakers (77 men and 77 women).
- TEDx Spanish Corpus Dataset contains spontaneous speech of several expositors in TEDx events; most of them are men. This is a gender-unbalanced corpus of 24 hours of duration.
2. Best Argentinian Spanish Speech Dataset
Created by Google in 2018, the Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset Speech dataset contains about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers., in Spanish (Argentinian) language. Containing ~5,900 in Wav file format.
Features:
- 16kHz, 16bit, uncompressed wav, mono channel
- quiet indoor environment, low background noise, without echo
- 1,651 speakers totally, with 43% male and 57% female
Alternatives:
- The Crowdsourced Argentinian Spanish Speech Dataset is a fantastic option – it contains recordings of simple weather messages recorded in Argentinian Spanish (90 messages), and Peninsular Spanish (90 messages). The dataset was recorded by volunteers in Buenos Aires, Argentina.
- The Emilia Argentinian Spanish Speech Dataset has a duration of three hours 15 minutes and is comprised of over 2,218 sentences.
3. Best Colombian Spanish Speech Dataset
This dataset was collected for speech technology research from native Colombian Spanish speakers who volunteered to supply the data.
Features:
- Over 4,900 sentences
- The audio is high quality (48kHz, 16 bit, mono, Wave audio)
- Recorded in a quiet environment
Alternatives:
- The OpenSLR Colombian Spanish Speech Dataset consists of high-quality audio recorded by volunteers – features male and female speakers.
- Colombian Spanish Speech Database contains the recordings of 1,065 speakers (563 males and 502 females) recorded over the fixed telephone network using an E-1 interface.
4. Best Mexican Spanish Speech Datasets
CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish broadcast speech with associated transcripts.
Features:
- Recorded in 16 kHz, 16-bit PCM flac format with transcripts presented as UTF-8 encoded plain text.
- Gender-balanced participants.
Alternatives:
- 300-hours of Mexican Spanish Speech are in the Magic Data Speech Dataset. Daily-use sentences from both male and female speakers.
- The Summa Linguae Mexican Speech Dataset contains recordings (69 utterances in total) of voice commands without a wake word in Mexican Spanish (es_MX) of 106 participants of age 16-65.
5. Best Basque Speech Dataset
This dataset was collected for speech technology research from native Basque speakers who volunteered to supply the data. The audio is high quality (48kHz, 16 bit, mono, Wave audio), recorded in a quiet environment.
Some quality checks have been done on the data, but there might still be mistranscriptions or artifacts in the audio.
Features:
- Contains Native and non-native speakers of Basque
Alternatives:
- ACL Anthology Open-Source Basque Speech Dataset has 33 hours of crowd-sourced recordings, from 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal, and business names.
- The EuskoParl Basque Spanish Speech Dataset has over 180 hours of recorded speech, with 81 male and female native speakers.
6. Best Catalan Speech Dataset
This data set contains transcribed high-quality audio of Catalan sentences recorded by volunteers, separated into male and female audio files.
Features:
- Contains Native and non-native speakers of Catalan
Alternatives:
- The Catalan Speech Database contains recordings of 550 adult Catalan speakers who uttered over 290 items (read and spontaneous). The data were recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place).
- The Castillian Spanish SpeechDat(II) FDB-1000 Dataset contains the recordings of 1,000 Castillian Spanish speakers (481 males, 519 females) recorded over the Spanish fixed telephone network.
Wrapping up
To conclude, here are top picks for the best Spanish Language Speech datasets for your projects:
- Biggest Non-Commercial Spanish Language Speech Dataset: Peninsular Spanish conversational speech
- Best Argentinian Spanish Speech Dataset: Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset Speech dataset
- Best Colombian Spanish Speech Dataset: Colombian Spanish [es-co] multi-speaker speech dataset
- Best Mexican Spanish Speech Dataset: CIEMPIESS Dataset
- Best Basque Speech Dataset: Basque Speech Dataset
- Best Catalan Speech Dataset: Catalan [ca-es] multi-speaker speech dataset
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.