Bengali is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Bengali language datasets to train your models.
That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Bengali Language datasets.
Are you ready?
Let’s dive in.
Here are our top picks for Bengali Language datasets:
Bengali Text to Speech Dataset
This dataset contains multi-speaker high quality transcribed audio data for Bengali. There are two zip files, one for each local which contains a file: line_index.tsv and the wave files. The line index has a fileID and the transcription and has been manually quality-checked, but there might still be errors.
Numta Handwritten Bengali Digits
The dataset is a compilation of six datasets that were gathered from different sources. However, each of them was checked rigorously under the same criteria, so that all digits were legible to at least one human being without any prior knowledge. The initial release of the NumtaDB dataset was used for the Bengali.AI Computer Vision Challenge. It was found that the testing set consisted of some illegible and ambiguous digits. These digits are replaced by legible digits of the same label.
Bengali Hate Speech
Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.
KU-BdSL Sign Language Dataset
The KU-BdSL Sign Language Dataset includes three variants of data. The dataset consists of images representing single-hand gestures for BdSL alphabets. Several smartphones are taken into account to capture images from 33 participants (25 males and 8 females). Each version includes 30 classes that resemble the 39 consonants (‘shoroborno‘) of Bengali alphabets. There is a total of 1,500 images in jpg format in each variant. The images are captured on flat surfaces at different times of the day to vary the brightness and contrast.
BanglaLM Dataset
This dataset contains content curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with the aim of performing necessary NLP tasks involving the Bengali language.
Wrapping up
To conclude, here are top picks for the best Bengali language datasets for your projects:
- Bengali Text to Speech Dataset
- Numta Handwritten Bengali Digits
- Bengali Hate Speech
- KU-BdSL Sign Language Dataset
- BanglaLM Dataset
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available.
Please let us know if there are any datasets you would like us to add to the list.
If you would like to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.