At Twine, we specialize in helping companies create high-quality custom audio and video datasets.
We often get asked if there are any off-the-shelf audio and video datasets we would recommend – both for testing, and, for them to use as custom approaches.
So we’ve ransacked the web to find only the top French Language datasets, so you don’t have to.
Are you ready? Let’s dive into our list of the best French Language datasets.
Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here.
Here are our top picks for French Language Datasets:
1. Biggest Non-Commercial French Language Dataset
The SIWIS French Speech Synthesis Database includes high-quality French speech recordings and associated text files, aimed at building TTS systems, investigating multiple styles, and emphasis. Various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasized words in many different contexts.
Features:
- 9750 utterances from various sources
- more than ten hours of speech data
- freely available
Not quite your style? Check out these alternatives:
- The CC100-French This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G., in French language.
- The Translation-Augmented-LibriSpeech-Corpus (Libri-Trans) Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers 236h of speech aligned to translated text., in French, English language. Containing 236 Hours in Text, WAV file format.
- BREF-120 consists of about 50-60 sentences per speaker and recordings conducted only with a Shure microphone. In BREF-80, the sentences were chosen to cover as many prompts as possible.
2. Best Child Adult Interaction French Language Dataset
The project “Treatment of Oral Corpus in French” (TCOF) was born from the desire to preserve oral corpora collected in the 80s and 90s for personal research purposes.
Features:
- 626 transcriptions (Transcriber and WAV), with a total duration of 146 hours for 1,542,562 words
3. Best Canadian French Language Dataset
The Canadian French Emotional (CaFE) speech dataset contains six different sentences, pronounced by six male and six female actors, in six basic emotions plus one neutral emotion. The six basic emotions are acted in two different intensities.
Features:
- The audio is digitally recorded at high-resolution (192 kHz sampling rate, 24 bits per sample).
- Freely available under a Creative Commons license (CC BY-NC-SA 4.0).
Alternatives:
- The CALLFRIEND Canadian French dataset consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).
4. Best French Native Reading Comprehension dataset
FQuAD is a French Native Reading Comprehension dataset that consists of 25,000+ questions created by higher education students on a set of Wikipedia articles.
Features:
- Over 120 articles
5. Best Scripted French Language Dataset
The French Scripted Speech Corpus dataset consists of 325 hours of transcribed French scripted speech focusing on daily-use sentences, news, command and query, and keyword spotting.
Features:
- Contributions by 489 speakers
- Recorded on mobile devices in quiet, indoor environments
- WAV (PCM) 16 kHz, 16 bits, mono
Wrapping up
To conclude, here are top picks for the best French Language datasets:
- Biggest Non-Commercial French Language Dataset – SIWIS French Speech Synthesis Database
- Best Child Adult Interaction French Language Dataset – Treatment of Oral Corpus in French
- Best Canadian French Language Dataset – The Canadian French Emotional Dataset
- Best French Native Reading Comprehension dataset – FQuAD
- Best Scripted French Language Dataset – The French Scripted Speech Corpus
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.