Best Indonesian Language Datasets of 2022

Indonesian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Indonesian language datasets with a specific dialect or type of speech to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Indonesian Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Indonesian Language datasets:

1. CASA (IndoNLU) Dataset

Created by Ilmania in 2018, the CASA (IndoNLU) Dataset is an aspect-based sentiment analysis consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms.

The task is defined as a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral. Exclusively in the Indonesian language, containing 1,08 in CSV file format.

Access the dataset

2. The Wiki Revision Edits Textual Entailment (WReTE) (IndoNLU) Dataset

Created by Setya and Mahendra in 2018, The Wiki Revision Edits Textual Entailment (WReTE) (IndoNLU) Dataset consists of 450 sentence pairs constructed from Wikipedia revision history. It contains pairs of sentences and binary semantic relations between the pairs. The data is labeled as entailed when the meaning of the second sentence can be derived from the first one, and not entailed otherwise. Exclusively in the Indonesian language, containing 450 in CSV file format.

Access the dataset

3. POSP (IndoNLU) Dataset

Created by Hoesen and Purwarianti in 2018, the POSP (IndoNLU) Dataset is collected from Indonesian news websites. The dataset consists of around 8,000 sentences with 26 POS tags. Exclusively in the Indonesian language, containing 84 in Text file format.

Access the dataset

4. SmSA (IndoNLU) Dataset

Created by Purwarianti and Crisdayanti in 2019, the SmSA (IndoNLU) Dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments: positive, negative, and neutral. Exclusively in the Indonesian language, containing 12,76 in TSV file format.

Access the dataset

5. KEPS (IndoNLU) Dataset

Created by Mahfuzh in 2019, the KEPS (IndoNLU) Dataset consists of text from Twitter discussing banking products and services. A phrase containing important information is considered a keyphrase.

Text may contain one or more key phrases since important phrases can be located at different positions. The dataset follows the IOB chunking format, which represents the position of the keyphrase. Exclusively in the Indonesian language, containing 1,247 in Text file format.

Access the dataset

6. CC100-Indonesian Dataset

Created by Conneaut & Wenzek in 2020, the CC100-Indonesian This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 36G. Exclusively in the Indonesian language, in Text file format.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Indonesian Language datasets for your projects:

  1. CASA (IndoNLU) Dataset
  2. The Wiki Revision Edits Textual Entailment (WReTE) (IndoNLU) Dataset
  3. POSP (IndoNLU) Dataset
  4. SmSA (IndoNLU) Dataset
  5. KEPS (IndoNLU) Dataset
  6. CC100-Indonesian Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.