The Best Polish Language Datasets of 2022

Polish is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Polish language datasets to train your models.

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Polish Language datasets.

Are you ready?

Let’s dive in.

Here are our top picks for Polish Language datasets:

1. CC100-Polish Dataset

Created by Conneau & Wenzek in 2020, the CC100-Polish This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G, exclusively in the Polish language. In-Text file format.

Access the dataset

2. NKJP-NER Dataset

Created by Przepiorkowski in 2020, the NKJP-NER Dataset contains extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity, in the Polish language. Containing 20 TSV.

Access the dataset

3. Cyberbullying Detection (CBD) Dataset

Created by Ptaszynskil in 2019, the Cyberbullying Detection (CBD) Dataset contains annotated tweets that identify harmful or non-harmful content., in the Polish language. In TSV file format.

Access the dataset

4. PolEmo2.0-IN & OUT Dataset

Created by Kocon in 2019, the PolEmo2.0-IN & OUT Dataset contains online reviews from medicine and hotel domains. The task is to predict the sentiment of a review. Exclusively in the Polish language, containing 8,216 in TSV file format.

Access the dataset

5. Did You Know (DYK) Dataset

Created by Marcinczuk in 2013, the Did You Know (DYK) Dataset contains 4,721 question-answer pairs obtained from the Czy wiesz (Do you know) Wikipedia project. Exclusively in the Polish language. Contains 4,721 TSV files.

Access the dataset

6. Polish Summaries Corpus (PSC) Dataset

Created by Ogrodniczuk in 2014, the Polish Summaries Corpus (PSC) Dataset contains news articles and their summaries. Exclusively in the Polish language. Contains 723 TSV files.

Access the dataset

7. Polish Parliamentary Corpus (PPC) Dataset

Created by Maciej Ogrodniczuk in 2018, the Polish Parliamentary Corpus (PPC) Dataset is a collection of linguistically analyzed documents from the proceedings of the Polish Parliament, Sejm, and Senate. It is based on the Polish Sejm Corpus. Exclusively in the Polish language. Containing 3,000+ XML files.

Access the dataset

Wrapping up

To conclude, here are top picks for the best Polish Language Speech datasets for your projects:

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to learn more about how we could help build a custom dataset for your project, please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

AI datasets machine learning

The Best Polish Language Datasets of 2022

Here are our top picks for Polish Language datasets:

1. CC100-Polish Dataset

2. NKJP-NER Dataset

3. Cyberbullying Detection (CBD) Dataset

4. PolEmo2.0-IN & OUT Dataset

5. Did You Know (DYK) Dataset

6. Polish Summaries Corpus (PSC) Dataset

7. Polish Parliamentary Corpus (PPC) Dataset

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

12 Leading Global Providers of AI Training Data You Should Know

6 Best Japanese Language Datasets for NLP and Machine Learning

AI Multi-Modal Annotation: The Top Service Providers

The Best Polish Language Datasets of 2022

Here are our top picks for Polish Language datasets:

1. CC100-Polish Dataset

2. NKJP-NER Dataset

3. Cyberbullying Detection (CBD) Dataset

4. PolEmo2.0-IN & OUT Dataset

5. Did You Know (DYK) Dataset

6. Polish Summaries Corpus (PSC) Dataset

7. Polish Parliamentary Corpus (PPC) Dataset

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

You may also like

12 Leading Global Providers of AI Training Data You Should Know

6 Best Japanese Language Datasets for NLP and Machine Learning

AI Multi-Modal Annotation: The Top Service Providers

Need audio training data?