Russian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Russian language datasets to train your models.
That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Russian Language datasets.
Are you ready?
Let’s dive in.
Here are our top picks for Russian Language datasets:
1. Golos Dataset
Golos is a Russian speech dataset suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.
2. Russian Speech Data by Mobile Phone
The Russian Speech Data by Mobile Phone Dataset involves 1960 Russian native speakers who participated in the recording with authentic accents. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, in-vehicle, and home. The text is manually proofread with high accuracy. It matches mainstream Android and Apple system phones.
3. Russian STT dataset
The Russian STT Dataset contains 16m utterances, 20 000 hours, 2,3 TB (in .wav format in int16), 356G in. Exclusively Russian language in WAV format.
4. CSS10 Dataset
CSS10 is a collection of single-speaker speech datasets for 10 languages. Each of them consists of audio files recorded by a single volunteer and their aligned text sourced from LibriVox. It is composed of short audio clips from LibriVox audiobooks and their aligned texts. To validate its quality, they train two neural text-to-speech models on each dataset. Subsequently, they conduct Mean Opinion Score tests on the synthesized speech samples.
5. Russian LibriSpeech (RuLS) Dataset
Russian LibriSpeech (RuLS) dataset is based on LibriVox’s public domain audiobooks (see BOOKS.TXT for the list of included books) and contains about 98 hours of audio data.
6. RuBQ Dataset
Created by Korablinov et al. in 2020, the RuBQ Dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels., in the Russian language. Contains 1,5 in JSON file format.
Wrapping up
To conclude, here are top picks for the best Russian Language Speech datasets for your projects:
- Golos Dataset
- Russian Speech Data by Mobile Phone
- Russian STT dataset
- CSS10 Dataset
- Russian LibriSpeech (RuLS) Dataset
- RuBQ Dataset
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.