The Best Vietnamese Language Datasets of 2022

Vietnamese is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Vietnamese language datasets to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Vietnamese Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Vietnamese Language datasets:

1. CC100-Vietnamese Dataset

Created by Conneau & Wenzek in 2020, the CC100-Vietnamese dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G in the Vietnamese language. In-Text file format.

Access the dataset

2. Vietnamese Multiple-choice Machine Reading Comprehension Corpus (ViMMRC) Dataset

Created by Nguyen in 2020, the Vietnamese Multiple-choice Machine Reading Comprehension Corpus (ViMMRC) Dataset contains 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. In the Vietnamese language, containing 417 files. [requires contacting the author for corpus]

Access the dataset

3. Vietnamese Social Media Emotion Corpus (UIT-VSMEC) Dataset

Created by Ho in 2019, the Vietnamese Social Media Emotion Corpus (UIT-VSMEC) Dataset contains 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in the Vietnamese language. Contains 6,927 Excel files.

Access the dataset

4. Vietnamese Image Captioning Dataset (UIT-ViIC) Dataset

This dataset contributes to the image captioning problem in terms of extending the image captioning dataset to different languages. The UIT-ViIC dataset is annotated manually in Vietnamese with the images from MS – COCO dataset. In addition, there is a built-in web-based annotation tool for improving annotators’ performances. UIT-ViIC in this scope consists of 19,250 captions for 3,850 images on sport-ball.

Access the dataset

5. UIT-SPC Dataset

The UIT-SPC corpus contains 1565 papers from top NLP/CL conferences such as ACL (2014, 2015, and 2016), CoNLL 2015, EACL 2014, NAACL 2015, and EMNLP 2015. First, they are pre-processed by removing unnecessary information in these papers (e.g formula, table, etc). Then, formatted by files .xml that include the title paper, sections, and sub-sections according to the paper’s structure. Exclusively in the Vietnamese language.

Please contact via email: thindv@uit.edu.vn (Mr. Thin Dang) to sign the corpus user agreement and receive the corpus.

6. UIT-ViNames Dataset

This dataset comprises over 26,000 full names annotated with genders. This dataset is available on their website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest, and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. 

Access the dataset


Wrapping up

To conclude, here are top picks for the best Vietnamese Language Speech datasets for your projects:

  1. CC100-Vietnamese Dataset
  2. Vietnamese Multiple-choice Machine Reading Comprehension Corpus (ViMMRC) Dataset
  3. Vietnamese Social Media Emotion Corpus (UIT-VSMEC) Dataset
  4. Vietnamese Image Captioning Dataset (UIT-ViIC) Dataset
  5. UIT-SPC Dataset
  6. UIT-ViNames Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to learn more about how we could help build a custom dataset for your project, please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.