The Best Ukrainian Language Datasets of 2022

Ukrainian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Ukrainian language datasets to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Ukrainian Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Ukrainian Language datasets:

CC100-Ukrainian Dataset

Created in 2020, the CC100-Ukrainian is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G, exclusively in the Ukrainian language. Contains files in Text file format.

Access the dataset

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Created in 2021, passages of copy and sentences-split for grammatical error correction and fluency were brought together for the UA-GEC dataset. Contains 20,715 sentences, 328,779 tokens, and 492 authors, exclusively in the Ukrainian language. Contains files in Text file format.

Access the dataset

Ukraine: Languages Dataset

Language data are drawn from the 2001 government census. Includes the percentage of the population for whom this is their main language, English speaking rates, and literacy rates of people age 15 and older. All data is drawn from government survey results and is subject to any associated limitations or distortions present in the source data.

Access the dataset

Universal Dependencies Dataset

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads, and universal dependency labels.

Access the dataset

Ukrainian Descriptions Of Words

This dataset contains descriptions of 15 unique words in simple terms by 8 different people. You can use the first 4 users (values in the column user 1-4) for model training, and users 5-8 for validation. Exclusively in the Ukrainian language.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Ukrainian language datasets for your projects:

  1. CC100-Ukrainian Dataset
  2. UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
  3. Ukraine: Languages Dataset
  4. Universal Dependencies Dataset
  5. Ukrainian Descriptions Of Words

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to learn more about how we could help build a custom dataset for your project, please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.