Top Russian Language Datasets of 2022

Russian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Russian language datasets to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Russian Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Russian Language datasets:

1. Golos Dataset

Golos is a Russian speech dataset suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.

Access the dataset

2. Russian Speech Data by Mobile Phone

The Russian Speech Data by Mobile Phone Dataset involves 1960 Russian native speakers who participated in the recording with authentic accents. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, in-vehicle, and home. The text is manually proofread with high accuracy. It matches mainstream Android and Apple system phones.

Access the dataset

3. Russian STT dataset

The Russian STT Dataset contains 16m utterances, 20 000 hours, 2,3 TB (in .wav format in int16), 356G in. Exclusively Russian language in WAV format.

Access the dataset

4. CSS10 Dataset

CSS10 is a collection of single-speaker speech datasets for 10 languages. Each of them consists of audio files recorded by a single volunteer and their aligned text sourced from LibriVox. It is composed of short audio clips from LibriVox audiobooks and their aligned texts. To validate its quality, they train two neural text-to-speech models on each dataset. Subsequently, they conduct Mean Opinion Score tests on the synthesized speech samples.

Access the dataset

5. Russian LibriSpeech (RuLS) Dataset

Russian LibriSpeech (RuLS) dataset is based on LibriVox’s public domain audiobooks (see BOOKS.TXT for the list of included books) and contains about 98 hours of audio data.

Access the dataset

6. RuBQ Dataset

Created by Korablinov et al. in 2020, the RuBQ Dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels., in the Russian language. Contains 1,5 in JSON file format.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Russian Language Speech datasets for your projects:

  1. Golos Dataset
  2. Russian Speech Data by Mobile Phone
  3. Russian STT dataset
  4. CSS10 Dataset
  5. Russian LibriSpeech (RuLS) Dataset
  6. RuBQ Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.