Persian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Persian language datasets to train your models.
That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Persian Language datasets.
Are you ready?
Let’s dive in.
Here are our top picks for Persian Language datasets:
CC100-Persian Dataset
Created in 2020, the CC100-Persian dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 20G, exclusively in the Persian language. Contains text files.
PerKey Dataset
Created by Doostmohammadi in 2020, the PerKey Dataset contains 553K news articles from six Persian news websites and agencies with author-extracted keyphrases. This is then filtered and cleaned to achieve higher quality keyphrases, exclusively in the Persian language. Contains 553,111 JSON files.
Perlex Dataset
Created by Asgari-Bidhendi in 2020, the Perlex Dataset is an expert-translated version of the Semeval-2010-Task-8 dataset. Exclusively in the Persian language. Contains 10,717 files.
Persian Language Sentiment Instagram Analysis Dataset
This dataset represents work on producing Insta-text, which is an Instagram comments Persian language sentiment analysis. In this study, about 111,000 Instagram comments have been scrapped and about 9,000 of them have been labeled using the crowdsourcing method. Word2vec model also has been used to validate the dataset.
Wrapping up
To conclude, here are top picks for the best Persian language datasets for your projects:
- CC100-Persian Dataset
- PerKey Dataset
- Perlex Dataset
- Persian Language Sentiment Instagram Analysis Dataset
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available.
Please let us know if there are any datasets you would like us to add to the list.
If you would like to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.