5 Best Arabic Language Datasets of 2022

At Twine, we specialize in helping companies create high-quality custom audio and video datasets.  

We often get asked if there are any off-the-shelf audio and video datasets we would recommend – both for testing and for them to use as custom approaches.

So, we’ve ransacked the web to find only the top Arabic Language datasets, so you don’t have to. 

Are you ready? Let’s dive into our list of the best Arabic Language datasets.


Here are our top picks for Arabic Language Datasets:

1. Biggest Arabic Language Dataset

The Massive Arabic Speech Corpus (MASC) contains 1,000 hours of speech sampled at 16~kHz and crawled from over 700 YouTube channels. MASC is a multi-regional, multi-genre, and multi-dialect dataset that is intended to advance the research and development of Arabic speech technology with a special emphasis on Arabic speech recognition.

Features:

  • The ADI17 dataset is available to download for research purposes under a Creative Commons Attribution-ShareAlike 4.0 International License.

Access the dataset here

2. Best Handwritten Arabic Language Dataset

The dataset is composed of 16,800 characters written by 60 participants, the age range is between 19 to 40 years, and 90% of participants are right-hand. Each participant wrote each character (from ’alef’ to ’yeh’) ten times in two forms.

Features:

  • Contains 16,800 handwritten Arabic characters.
  • The database is partitioned into two sets: a training set (13,440 characters to 480 images per class) and a test set (3,360 characters to 120 images per class).

Access the dataset here

3. Best Diverse Arabic Language Dataset

The Arabic Dialect Identification for 17 countries (ADI17) Dataset contains around 3,000 hours of Arabic dialect speech data from 17 countries in the Arabic world, which was collected from YouTube. Due to the way in which the speech data has been collected from YouTube channels, the creators admit that the dataset might have some labeling errors. 

Features:

  • The ADI17 dataset is available to download for research purposes under a Creative Commons Attribution-ShareAlike 4.0 International License.

Access the dataset here

4. V7 Arabic Handwritten Characters Dataset

The dataset is composed of 16,800 characters written by 60 participants, the age range is between 19 to 40 years, and 90% of participants are right-hand. Each participant wrote each character (from ’alef’ to ’yeh’) ten times in two forms. The forms were scanned at a resolution of 300 dpi.

Access the dataset here

Alternatives:

  • The CALLFRIEND Canadian Arabic dataset consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).

5. Best Scripted Arabic Language Dataset

The Arabic Scripted Speech Corpus dataset consists of 325 hours of transcribed Arabic scripted speech focusing on daily-use sentences, news, command and query, and keyword spotting.

Features:

  • Contributions by 489 speakers
  • Recorded on mobile devices in quiet, indoor environments
  • WAV (PCM) 16 kHz, 16 bits, mono

Access the dataset here


Wrapping up

To conclude, here are top picks for the best Arabic Language datasets:

  1. Biggest Non-Commercial Arabic Language Dataset – The Massive Arabic Speech Corpus
  2. Best Child Adult Interaction Arabic Language Dataset – Treatment of Oral Corpus in Arabic
  3. Best Canadian Arabic Language Dataset – The Canadian Arabic Emotional Dataset
  4. Best Arabic Native Reading Comprehension dataset – FQuAD
  5. Best Scripted Arabic Language Dataset – The Arabic Scripted Speech Corpus

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.