Alongside Hindi, there are 22 official languages across India. With this in mind, it can be difficult to find the exact dataset you need. That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Indian Language speech datasets.
Are you ready?
Let’s dive into our list of the best Indian Language speech datasets in 2022.
Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here.
Here are our top picks for the best Indian Language Datasets out there:
1. The Biggest Indian Language Dataset
Microsoft Speech Corpus (Indian languages) is currently the biggest Indian language dataset and contains conversational and phrasal speech training and test data for Gujarati, Telugu, and Tamil languages.
Features:
- Audio in WAV format
- Non-commercial dataset
Alternatives:
The Indian Language Recognition Dataset is a massive 20GB dataset of audio samples of 10 different Indian languages. Each audio sample is of 5 seconds duration. This dataset was created using regional videos available on YouTube.
2. Best Hindi Dataset
The Hindi Raw Speech Corpus speech data is collected from the regions of Awadhi belt, Bhojpuri belt, Magahi belt, and Khariboli belt from all genders and different age groups. LDC-IL Hindi speech data has 121:00:06 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts, and date formats.
Features:
- Total Speakers: 488 (234 Female and 254 Male)
- 70,686 Audio Segments | 48 kHz | 16 bit wav
- Data package includes audio and corresponding transcripts.
Alternatives:
- OpenSLR Hindi Speech Dataset is split into train and test sets with 95.05 hours and 5.55 hours of audio respectively. The train set contains utterances from a set of 59 speakers, and the test set contains speakers from a disjoint set of 19 speakers. The audio files are sampled at 8kHz, 16-bit encoding.
- The Hindi Speech Recognition database was collected in Uttar Pradesh and Bihar and contains the voices of 650 different native speaker who were selected according to age distribution (16-20,21-50,51+), Gender, Dialectical Regions and environment ( home, office and public place).
3. Best Gujarati language Dataset
The Gujarati Raw Speech Corpus consists of recordings of four dialects, namely South Gujarat, Central Gujarat, North Gujarat, and Saurashtra. LDC-IL has 57:17:08 hours Gujarati raw speech data. The LDC-IL Gujarat Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts, and data formats. Each speaker recorded these datasets which are randomly selected from a master dataset.
Features:
- 96 female and 108 male speakers – approximately 15 minutes of speech per speaker
- Gujarati mother-tongue speakers of different age groups
Alternatives:
- The Crowdsourced high-quality Gujarati multi-speaker speech data set was collected for speech technology research from native Gujarati speakers who volunteered to supply the data. The audio is high quality (48kHz, 16 bit, mono, Wave audio), recorded in a quiet environment.
4. Best Urdu language Dataset
The Urdu Speech dataset presents speech files recorded for isolated words of Urdu, consisting of 2,500 Urdu audio samples.
Features:
- Comprises of 250 isolated words of Urdu recorded by ten individuals
- The sampling frequency is 16000 Hz.
- Speakers include both native and non-native, male and female individuals
- The corpus can be used for both speech and speaker recognition tasks.
Alternatives:
- The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. The recordings in this release were collected by Appen Pty Ltd, Sydney, Australia in 2006. The U.S. Army Research Laboratory (ARL) provided this corpus to the Linguistic Data Consortium for distribution.
- The Urdu Raw Speech Corpus comprises recordings from 499 participants collected from various age groups of male and female native speakers. This data includes Texts, Sentences, Date Formats, and different wordlists.
5. Best Bengali language Dataset
The LDC-IL Bengali Speech dataset consists of recordings of participants from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts, and date formats.
Alternatives:
- The Bangla Real Number Audio Dataset contains Some Recording Audio of Bangla Real Number and Its Coresponding Text and is specially designed for Bangla Speech recognition. There are five speakers (alamin, ashraful, midhat, nahid, nayem) in this dataset. Vocabulary Contains only bangla real numbers (shunno-ekshoto, hazar, loksho, koti, doshomic etc.) Total Number of Audio file : 175 (35 from each speaker). Age range of the speakers : 20-23
- The Crowdsourced high-quality Bengali [bn-in/bn-bd] multi-speaker speech dataset was collected from native Indian Bengali and Bangladesh Bengali speakers who volunteered to supply the data. The audio is high quality (48kHz, 16 bit, mono, Wave audio), recorded in a quiet environment.
6. Best Malayalam language dataset
The LDC-IL Malayalam Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts, and date formats. Approximately 15 minutes of speech (per speaker) has taken from 231 female and 227 Male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.
Alternatives:
- The Crowdsourced high-quality Malayalam multi-speaker speech data set contains transcribed high-quality audio of Malayalam sentences recorded by volunteers. The data set consists of wave files and a TSV transcript file.
- The Indic TTS Malayalam Speech Corpus contains audio files in wav format sampled at 48 kHz with 16 bit PCM encoding. Each audio file is an utterance of a Malayalam sentences spoken by native Malayalam speakers (One male and one female voice).
Wrapping up
To conclude, here are top picks for the best Indian Language Speech datasets:
- Best Hindi Dataset – The Hindi Raw Speech Corpus
- The Biggest Indian Language Datasets – Microsoft Indian Speech Corpus
- Best Gujarati language datasets – The Gujarati Raw Speech Corpus
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.