Top Indian Language Datasets of 2022

Alongside Hindi, there are 22 official languages across India. With this in mind, it can be difficult to find the exact dataset you need. That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Indian Language speech datasets.

Are you ready?

Let’s dive into our list of the best Indian Language speech datasets in 2022.

Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here.

Here are our top picks for the best Indian Language Datasets out there:

1. The Biggest Indian Language Dataset

Microsoft Speech Corpus (Indian languages) is currently the biggest Indian language dataset and contains conversational and phrasal speech training and test data for Gujarati, Telugu, and Tamil languages.

Features:

Audio in WAV format
Non-commercial dataset

Access the dataset

Alternatives:

The Indian Language Recognition Dataset is a massive 20GB dataset of audio samples of 10 different Indian languages. Each audio sample is of 5 seconds duration. This dataset was created using regional videos available on YouTube.

2. Best Hindi Dataset

The Hindi Raw Speech Corpus speech data is collected from the regions of Awadhi belt, Bhojpuri belt, Magahi belt, and Khariboli belt from all genders and different age groups. LDC-IL Hindi speech data has 121:00:06 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts, and date formats.

Features:

Total Speakers: 488 (234 Female and 254 Male)
70,686 Audio Segments | 48 kHz | 16 bit wav
Data package includes audio and corresponding transcripts.

Access the dataset

Alternatives:

OpenSLR Hindi Speech Dataset is split into train and test sets with 95.05 hours and 5.55 hours of audio respectively. The train set contains utterances from a set of 59 speakers, and the test set contains speakers from a disjoint set of 19 speakers. The audio files are sampled at 8kHz, 16-bit encoding.
The Hindi Speech Recognition database was collected in Uttar Pradesh and Bihar and contains the voices of 650 different native speaker who were selected according to age distribution (16-20,21-50,51+), Gender, Dialectical Regions and environment ( home, office and public place).

3. Best Gujarati language Dataset

The Gujarati Raw Speech Corpus consists of recordings of four dialects, namely South Gujarat, Central Gujarat, North Gujarat, and Saurashtra. LDC-IL has 57:17:08 hours Gujarati raw speech data. The LDC-IL Gujarat Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts, and data formats. Each speaker recorded these datasets which are randomly selected from a master dataset.

Features:

96 female and 108 male speakers – approximately 15 minutes of speech per speaker
Gujarati mother-tongue speakers of different age groups

Access the dataset

Alternatives:

The Crowdsourced high-quality Gujarati multi-speaker speech data set was collected for speech technology research from native Gujarati speakers who volunteered to supply the data. The audio is high quality (48kHz, 16 bit, mono, Wave audio), recorded in a quiet environment.

4. Best Urdu language Dataset

The Urdu Speech dataset presents speech files recorded for isolated words of Urdu, consisting of 2,500 Urdu audio samples.

Features:

Comprises of 250 isolated words of Urdu recorded by ten individuals
The sampling frequency is 16000 Hz.
Speakers include both native and non-native, male and female individuals
The corpus can be used for both speech and speaker recognition tasks.

Access the dataset

Alternatives:

The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. The recordings in this release were collected by Appen Pty Ltd, Sydney, Australia in 2006. The U.S. Army Research Laboratory (ARL) provided this corpus to the Linguistic Data Consortium for distribution.
The Urdu Raw Speech Corpus comprises recordings from 499 participants collected from various age groups of male and female native speakers. This data includes Texts, Sentences, Date Formats, and different wordlists.

5. Best Bengali language Dataset

The LDC-IL Bengali Speech dataset consists of recordings of participants from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts, and date formats.

Access the dataset

Alternatives:

The Bangla Real Number Audio Dataset contains Some Recording Audio of Bangla Real Number and Its Coresponding Text and is specially designed for Bangla Speech recognition. There are five speakers (alamin, ashraful, midhat, nahid, nayem) in this dataset. Vocabulary Contains only bangla real numbers (shunno-ekshoto, hazar, loksho, koti, doshomic etc.) Total Number of Audio file : 175 (35 from each speaker). Age range of the speakers : 20-23
The Crowdsourced high-quality Bengali [bn-in/bn-bd] multi-speaker speech dataset was collected from native Indian Bengali and Bangladesh Bengali speakers who volunteered to supply the data. The audio is high quality (48kHz, 16 bit, mono, Wave audio), recorded in a quiet environment.

6. Best Malayalam language dataset

The LDC-IL Malayalam Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts, and date formats. Approximately 15 minutes of speech (per speaker) has taken from 231 female and 227 Male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.

Access the dataset

Alternatives:

The Crowdsourced high-quality Malayalam multi-speaker speech data set contains transcribed high-quality audio of Malayalam sentences recorded by volunteers. The data set consists of wave files and a TSV transcript file.
The Indic TTS Malayalam Speech Corpus contains audio files in wav format sampled at 48 kHz with 16 bit PCM encoding. Each audio file is an utterance of a Malayalam sentences spoken by native Malayalam speakers (One male and one female voice).

Wrapping up

To conclude, here are top picks for the best Indian Language Speech datasets:

Best Hindi Dataset – The Hindi Raw Speech Corpus
The Biggest Indian Language Datasets – Microsoft Indian Speech Corpus
Best Gujarati language datasets – The Gujarati Raw Speech Corpus

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

AI datasets machine learning

Top Indian Language Datasets of 2022

Here are our top picks for the best Indian Language Datasets out there:

1. The Biggest Indian Language Dataset

Features:

Alternatives:

2. Best Hindi Dataset

Features:

Alternatives:

3. Best Gujarati language Dataset

Features:

Alternatives:

4. Best Urdu language Dataset

Features:

Alternatives:

5. Best Bengali language Dataset

Alternatives:

6. Best Malayalam language dataset

Alternatives:

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

Need audio training data?