The Best Dutch Language Datasets of 2022

Dutch is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Dutch language datasets to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Dutch Language datasets.

Are you ready?

Let’s dive in.

Here are our top picks for Dutch Language datasets:

1. CC100-Dutch Dataset

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G, exclusively in the Dutch language. Contains Text files.

Access the dataset

2. Dutch Book Reviews Dataset

Created by van der Burgh in 2019, the Dutch Book Reviews Dataset contains book reviews along with associated binary sentiment polarity labels, exclusively in the Dutch language. Containing 118,516 Text files.

Access the dataset

3. Personae Corpus Dataset

Created by Luyckx in 2008, the Personae Corpus was collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays, exclusively in the Dutch language. Contains 145 Text files.

Access the dataset

4. Conference on Computational Natural Language Learning Dataset

Created by Tjong in 2002, the Conference on Computational Natural Language Learning (CoNLL 2002) dataset is a collection of newswire articles made available by the EFE News Agency. The Dutch data consist of four editions of the Belgian newspaper “De Morgen” of 2000. IOB2 format, exclusively in the Dutch language. Contains HTML files.

Access the dataset

5. SICK-NL Dataset

The SICK-NL Dataset targets Natural Language Inference in Dutch. SICK-NL is obtained by translating the SICK dataset of Marelli et al. (2014) from English into Dutch. Both monolingual and multilingual NLP models are compared, for English and Dutch, on the two tasks. Exclusively in the Dutch language. Contains Text files.

Access the dataset

6. CELEX Dataset

CELEX database comprises three different searchable lexical databases, Dutch, English, and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class), and word frequency. Contains HTML/Text files.

Access the dataset

7. DpgMedia2019

DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labeled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labeled on the article level.

Access the dataset

Wrapping up

To conclude, here are top picks for the best Dutch Language Speech datasets for your projects:

  1. CC100-Dutch Dataset
  2. Dutch Book Reviews Dataset
  3. Personae Corpus Dataset
  4. Conference on Computational Natural Language Learning Dataset
  5. SICK-NL Dataset
  6. CELEX Dataset
  7. DpgMedia2019

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to learn more about how we could help build a custom dataset for your project, please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.