At Twine, we specialize in helping companies create high-quality custom audio and video datasets.
We often get asked if there are any off-the-shelf audio and video datasets we would recommend – both for testing and for them to use as custom approaches.
So, we’ve ransacked the web to find only the top Arabic Language datasets, so you don’t have to.
Are you ready? Let’s dive into our list of the best Arabic Language datasets.
Here are our top picks for Arabic Language Datasets:
1. Biggest Arabic Language Dataset
The Massive Arabic Speech Corpus (MASC) contains 1,000 hours of speech sampled at 16~kHz and crawled from over 700 YouTube channels. MASC is a multi-regional, multi-genre, and multi-dialect dataset that is intended to advance the research and development of Arabic speech technology with a special emphasis on Arabic speech recognition.
Features:
- The ADI17 dataset is available to download for research purposes under a Creative Commons Attribution-ShareAlike 4.0 International License.
2. Best Handwritten Arabic Language Dataset
The dataset is composed of 16,800 characters written by 60 participants, the age range is between 19 to 40 years, and 90% of participants are right-hand. Each participant wrote each character (from ’alef’ to ’yeh’) ten times in two forms.
Features:
- Contains 16,800 handwritten Arabic characters.
- The database is partitioned into two sets: a training set (13,440 characters to 480 images per class) and a test set (3,360 characters to 120 images per class).
3. Best Diverse Arabic Language Dataset
The Arabic Dialect Identification for 17 countries (ADI17) Dataset contains around 3,000 hours of Arabic dialect speech data from 17 countries in the Arabic world, which was collected from YouTube. Due to the way in which the speech data has been collected from YouTube channels, the creators admit that the dataset might have some labeling errors.
Features:
- The ADI17 dataset is available to download for research purposes under a Creative Commons Attribution-ShareAlike 4.0 International License.
4. V7 Arabic Handwritten Characters Dataset
The dataset is composed of 16,800 characters written by 60 participants, the age range is between 19 to 40 years, and 90% of participants are right-hand. Each participant wrote each character (from ’alef’ to ’yeh’) ten times in two forms. The forms were scanned at a resolution of 300 dpi.
Alternatives:
- The CALLFRIEND Canadian Arabic dataset consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).
5. Best Scripted Arabic Language Dataset
The Arabic Scripted Speech Corpus dataset consists of 325 hours of transcribed Arabic scripted speech focusing on daily-use sentences, news, command and query, and keyword spotting.
Features:
- Contributions by 489 speakers
- Recorded on mobile devices in quiet, indoor environments
- WAV (PCM) 16 kHz, 16 bits, mono
Wrapping up
To conclude, here are top picks for the best Arabic Language datasets:
- Biggest Non-Commercial Arabic Language Dataset – The Massive Arabic Speech Corpus
- Best Child Adult Interaction Arabic Language Dataset – Treatment of Oral Corpus in Arabic
- Best Canadian Arabic Language Dataset – The Canadian Arabic Emotional Dataset
- Best Arabic Native Reading Comprehension dataset – FQuAD
- Best Scripted Arabic Language Dataset – The Arabic Scripted Speech Corpus
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.