The Best Greek Language Datasets of 2022

Greek is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Greek language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Greek Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Greek Language datasets:

CC100-Greek Dataset

Created in 2020, the CC100-Greek dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.4G, exclusively in the Greek language. Contains Text files.

Access the dataset

OGTD (Offensive Greek Tweet Dataset)

OGTD is the first collection of Greek tweets containing profanity. This dataset was first created for the completion of my MA for the University of Wolverhampton in the Computation Linguistics course, in a thesis titled “Detecting Offensive Posts in Greek Social Media”. A manually annotated dataset containing 4,779 posts from Twitter. Data is annotated as offensive and not offensive. Exclusively in the Greek language. Contains text files.

Access the dataset

AESI (Athens Emotional States Inventory) Dataset

The AESI dataset consists of sentences each having content indicative of the corresponding emotion. The emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. The resulting database contains 696 recorded utterances in the Greek language by 20 native speakers and has a total duration of approximately 28 min. Contains audio files.

Access the dataset

The Greek Sign Language (GSL) Dataset

This dataset contains 10,290 sentence instances, 40,785 gloss instances, 310 unique glosses (vocabulary size), and 331 unique sentences, with 4.23 glosses per sentence on average. Each signer is asked to perform the pre-defined dialogues five consecutive times. In all cases, the simulation considers a deaf person communicating with a single public service employee. 

Access the dataset


Wrapping up

To conclude, here are top picks for the best Greek language datasets for your projects:

  1. CC100-Greek Dataset
  2. OGTD (Offensive Greek Tweet Dataset)
  3. AESI (Athens Emotional States Inventory) Dataset
  4. The Greek Sign Language (GSL) Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you want to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.