6 Best German Language Datasets of 2022

German is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find german language datasets with a specific dialect or type of speech to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best German Language datasets.

Are you ready?

Let’s dive into our list of the best German Language datasets in 2022.


Here are our top picks for German Language datasets:

1. Multi30k Dataset

Created by Elliott et al. in 2016, the Multi30k Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset., in German, and English language. Containing 31,014 in n/a file format.

Access the dataset

2. WebNLG (Enriched) Dataset

Created by Gardent et al. in 2017, the WebNLG (Enriched) Dataset consists of 25,298 (data, text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalizing these data units., in German, and English language. Containing 25,298 in XML file format.

Access the dataset

3. Ten Thousand German News Articles Dataset (10kGNAD) Dataset

Created by Timo Block in 2019, the Ten Thousand German News Articles Dataset (10kGNAD) Dataset consists of 10273 german language news articles from an Austrian online newspaper categorized into nine topics., in German language. Containing 10,273 in CSV file format.

Access the dataset

4. CC100-German Dataset

Created by Conneau & Wenzek et al. in 2020, the CC100-German This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 18G., in the German language. Containing n/a in Text file format.

Access the dataset

5. GermEval 2014 NER Shared Task Dataset

Created by Benikova et al. in 2014, the GermEval 2014 NER Shared Task The data was sampled from German Wikipedia and News Corpora as a collection of citations. The dataset covers over 31,000 sentences corresponding to over 590,000 tokens., in the German language. Containing 31,000+ in TSV file format.

Access the dataset

6. Event-focused Emotion Corpora for German and English Dataset

Created by Troiano et al. in 2019, the Event-focused Emotion Corpora for German and English German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources., in German, English language. Contains 2,002 in TSV file format.

Access the dataset


Wrapping up

To conclude, here are top picks for the best German Language datasets for your projects:

  1. Multi30k Dataset
  2. WebNLG (Enriched) Dataset
  3. Ten Thousand German News Articles Dataset (10kGNAD) Dataset
  4. CC100-German Dataset
  5. GermEval 2014 NER Shared Task Dataset
  6. Event-focused Emotion Corpora for German and English Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.