The Best Czech Language Datasets of 2022

Czech is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Czech language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Czech Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Czech Language datasets:

CC100-Czech Dataset

Created in 2020, the CC100-Czech dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.4G, exclusively in the Czech language. Contains Text files.

Access the dataset

CWMT 2016 News Dataset

This dataset is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, Romanian, Russian, Turkish) and additional 1500 sentences from each of the 5 languages translated to English. The training data consists of parallel corpora to train translation models, monolingual corpora to train language models, and development sets for tuning.

Access the dataset

AKCES-GEC

AKCES-GEC is a new dataset on grammatical error correction for Czech. Exclusively in the Czech language, contains 47371 sentences, 11 files, and 505275 words in text files. Originated from the LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.

Access the dataset

Czech Restaurant Information Dataset

Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen in 2015. It includes input dialogue acts and the corresponding output natural language paraphrases in Czech. Contains text files.

Access the dataset

Czech Subjectivity Dataset

Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions. Exclusively in the Czech language, contains text files. 

Access the dataset


Wrapping up

To conclude, here are top picks for the best Czech language datasets for your projects:

  1. CC100-Czech Dataset
  2. CWMT 2016 News Dataset
  3. AKCES-GEC
  4. Czech Restaurant Information Dataset
  5. Czech Subjectivity Dataset

We hope that this list has helped you find a dataset for your project or, realize the myriad options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you want to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.