Bulgarian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Bulgarian language datasets to train your models.
That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Bulgarian Language datasets.
Are you ready?
Let’s dive in.
Here are our top picks for Bulgarian Language datasets:
CCC100-Bulgarian Dataset
Created in 2020, the CC100-Bulgarian dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 9.3G, exclusively in the Bulgarian language. Contains Text files.
Bulgarian Reading Comprehension Dataset
A dataset containing 2,221 questions from matriculation exams for the twelfth grade in various subjects -history, biology, geography, and philosophy-, and 412 additional questions from online quizzes in history.
EMP-BTB-CSLI-MWA Dataset
This dataset is based on the test set, distributed by the English Resource Grammar. It was translated by professional translators in Bulgarian. The source language is English, and the target language is Bulgarian. It contains 893 sentence pairs.
Wrapping up
To conclude, here are top picks for the best Bulgarian language datasets for your projects:
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to learn more about how we could help build a custom dataset for your project, please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.