Twine AI enables businesses to build ethical, custom datasets that reduce model bias and cover areas where humans are subjects, such as voice and vision. To help make model-building easier, we have put together a list of over 150 Open Audio and Video Datasets.
No matter the requirement—from dataset language to file type to participant gender—there is a dataset perfect for your machine-learning model.
Simply browse and sign up to gain access.
Need a custom dataset specific to your project? Contact us here.
150+ Audio and Video Open Datasets
Open Datasets – Audio
Urban Sound 8K dataset |
No. Recordings: 8732 File Size: 13.84KB Filetype: .WAV/.CSV Language(s): US English Description: Contains Urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc. Click here to access |
Mozilla Common Voice |
No. Recordings: 75,879 File Size: 63Gb Filetype: MP3 Language(s): US English Description: An open-source, multi-language dataset of voices that anyone can use to train speech-enabled applications. Click here to access |
HiEve |
No. Recordings: 1,000,000 Filetype: MP4 Language(s): US English Description: The largest collection of poses that focuses on very challenging and realistic tasks of human-centric analysis in various crowds & complex events, including subway getting on/off, collision, fighting, and earthquake escape Click here to access |
Voices Obscured in Complex Environmental Settings (VOICES) Dataset |
No. Recordings: 3,903 File Size: 1.3Gb Filetype: MP3 Language(s): US English Description: A creative commons speech dataset targeting acoustically challenging and reverberant environments with robust labels and truth data for transcription, denoising, and speaker identification. Click here to access |
Free Spoken digit dataset |
No. Recordings: 3000 No. Participants: 6 File Size: 10Mb Filetype: WAV Language(s): US English Description: A simple audio or speech data which consists of recordings of spoken English digits Click here to access |
The Stereo Human Pose Estimation Dataset |
No. Recordings: 630 No. Participants: 26 File Size: 197.8Mb Filetype: JPEG Language(s): US English Description: A dataset of stereo image pairs suited for stereo human pose estimation of upper-body people. Click here to access |
The Spoken Wikipedia Corpora |
No. Recordings: 5,397 No. Participants: 879 File Size: 23Gb Filetype: MP3 Language(s): US English Description: This is a corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia Click here to access |
TED-LIUM |
No. Recordings: 1,495 Language(s): US English Description: Audio transcription of TED talks. 1495 TED talks audio recordings along with full-text transcriptions of those recordings Click here to access |
Speech Commands Dataset |
No. Recordings: 65,000 Language(s): US English Description: 65,000 one-second-long utterances of 30 short words, by thousands of different people Click here to access |
Persian Consonant Vowel Combination (PCVC) Speech Dataset |
No. Recordings: 30,000 No. Participants: 217 Filetype: MAT Language(s): US English Description: This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples. Click here to access |
Arabic Speech Corpus |
No. Recordings: 5439 Filetype: WAV Language(s): Arabic Description: Phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with a recorded speech on the phoneme level Click here to access |
TIMIT |
No. Recordings: 6,300 No. Participants: 630 Filetype: WAV Language(s): US English Description: Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences Click here to access |
Mivia Audio Events Dataset |
No. Recordings: 6,000 Filetype: WAV Language(s): US English Description: 6,000 events of surveillance applications, namely glass breaking, gunshots, and screams Click here to access |
Urban Sound Dataset |
No. Recordings: 1,302 Filetype: WAV Language(s): US English Description: 1302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music Click here to access |
Clotho Dataset |
No. Recordings: 4,981 Filetype: MP3 Language(s): US English Description: A novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions Click here to access |
FSD50K |
No. Recordings 51,197: Filetype: WAV Language(s): US English Description: An open dataset of human-labeled sound events containing Freesound clips unequally distributed in 200 classes Click here to access |
Vocal Imitation Set v1.1.3 |
File Size: 7.6Gb Filetype: WAV Language(s): US English Description: A collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound Click here to access |
Google Audio set |
No. Recordings: 2,084,320 Filetype: WAV Language(s): US English Description: 635 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos Click here to access |
CALLHOME American English Speech |
No. Recordings: 120 No. Participants: 240 Language(s): US English Description: 120 unscripted 30-minute telephone conversations between native speakers of English Click here to access |
LibriSpeech ASR Corpus |
No. Recordings: 1,000 Filetype: MP3 Language(s): US English Description: 1,000 hours of 16kHz read English speech Click here to access |
Speech Accent Archive |
No. Recordings: 2,140 File Size: 907Mb Filetype: MP3 Language(s): US English Description: Parallel English speech samples from 177 countries Click here to access |
Phone Conversation Data Sample |
No. Recordings: 1,822 Filetype: WAV Language(s): US English Description: Conversations in Dutch, Japanese, and Irish English Click here to access |
Alexa Wake Word Voice Samples |
No. Recordings: 24 Filetype: WAV Language(s): US English Description: Sample of 24 Alexa wake word recordings in four languages Click here to access |
The LJ Speech Dataset |
No. Recordings: 1,300 File Size: 2.6Gb Filetype: CSV Language(s): US English Description: Public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books Click here to access |
AISHELL-2 |
No. Recordings: 1,000,000 No. Participants: 1,991 Language(s): Mandarin Description: The largest free speech corpus available for Mandarin ASR research Click here to access |
AEDD |
No. Recordings: 500 No. Participants: 5 Language(s): US English Description: 500 utterances by a diverse group of actors (over 5 actors) simulating various emotions Click here to access |
ANAD |
No. Recordings: 1,384 No. Participants: 8 File Size: 2Gb Filetype: WAV Language(s): US English Description: 1384 recordings by multiple speakers; 3 emotions: angry, happy, surprised Click here to access |
AudioMNIST |
No. Recordings: 30,000 No. Participants: 60 Filetype: MP3 Language(s): US English Description: Consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers Click here to access |
BAVED |
No. Recordings: 1,935 No. Participants: 61 File Size: WAV Filetype: 97.8Mb Language(s): US English Description: 1935 recording by 61 speakers (45 male and 16 female). Click here to access |
CMU-MOSEI |
No. Participants: 1,000 Language(s): US English Description: 65 hours of annotated video from more than 1000 speakers and 250 topics; 6 Emotions (happiness, sadness, anger, fear, disgust, surprise) + Likert scale. Click here to access |
CMU-MOSI |
No. Recordings: 2,199 Language(s): US English Description: 2199 opinion utterances with annotated sentiment; Sentiment annotated between very negative to very positive in seven Likert steps Click here to access |
CMU Wilderness |
No. Participants: 699 Filetype: Mp3 Language(s): US English Description: Speech dataset with voice actors of many accents reciting passages from the Bible Click here to access |
CREMA-D |
No. Recordings: 7,442 No. Participants: 91 File Size: 163Mb Filetype: GIT-LFS Language(s): US English Description: 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities Click here to access |
DAPS Dataset |
No. Recordings: 100 No. Participants: 200 Language(s): US English Description: 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books Click here to access |
Deep Clustering Dataset |
File Size: 12Mb Filetype: WAV / Mp3 / OGG Language(s): US English Description: Training deep discriminative embeddings to solve the cocktail party problem Click here to access |
DEMoS |
No. Recordings: 9697 No. Participants: 68 Language(s): US English Description: 9365 emotional and 332 neutral samples were produced by 68 native speakers Click here to access |
EEKK |
No. Recordings: 1234 No. Participants: 10 Filetype: MP3 Language(s): US English Description: 26 text passages read by 10 speakers; 4 main emotions: joy, sadness, anger, and neutral Click here to access |
Emo-DB |
No. Recordings: 500 No. Participants: 10 Language(s): US English Description: 800 recordings spoken by 10 actors (5 males and 5 females); 7 emotions: anger, neutral, fear, boredom, happiness, sadness, disgust Click here to access |
EmoFilm |
No. Recordings: 1115 Filetype: WAV Language(s): US English Description: 1115 audio instances sentences extracted from various films Click here to access |
Emotional Voice dataset – Nature |
No. Recordings: 2519 No. Participants: 100 Language(s): US English Description: 2,519 speech samples were produced by 100 actors from 5 cultures Click here to access |
Emov-DB |
No. Recordings: No. Participants: 4 File Size: 1.58GB Language(s): US English Description: Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust, and amused Click here to access |
EMOVO |
No. Recordings: 84 No. Participants: 6 Language(s): US English Description: 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness Click here to access |
eNTERFACE05 |
No. Participants: 42 File Size: 801MB Language(s): US English Description: Videos by 42 subjects, coming from 14 different nationalities; 6 emotions: anger, fear, surprise, happiness, sadness and disgust Click here to access |
GEMEP corpus |
No. Recordings: 145 No. Participants: 10 Filetype: MP3 Language(s): US English Description: 10 actors portraying 10 different emotional states Click here to access |
IEMOCAP |
No. Participants: 10 Filetype: WAV Language(s): US English Description: 12 hours of audiovisual data by 10 actors; 5 emotions: happiness, anger, sadness, frustration, and neutral Click here to access |
Keio-ESD |
Filetype: WAV Language(s): US English Description: A set of human speech with vocal emotion spoken by a Japanese male speaker; 47 emotions including angry, joy, disgusting, downgrading, funny, worried, gentle, relief, indignation, shame, etc. Click here to access |
MSP-IMPROV |
No. Recordings: 8,438 No. Participants: 12 Language(s): US English Description: 20 sentences by 12 actors; 4 emotions: angry, sad, happy, neutral, other, without agreement Click here to access |
MSP Podcast Corpus |
No. Recordings: 62140 No. Participants: 3260 Language(s): US English Description: 100 hours by over 100 speakers – annotated with emotional labels using attribute-based descriptors Click here to access |
NISQA Speech Quality Corpus |
No. Recordings: 14,000 No. Participants: 3,260 Language(s): US English Description: Includes 14k speech samples with simulated (codecs, packet-loss, background noise) and live (mobile phone, Zoom, Skype, WhatsApp) voice call degradation conditions Click here to access |
OGVC |
No. Recordings: 9114 No. Participants: 4 Language(s): US English Description: 9114 spontaneous utterances and 2656 acted utterances by 4 professional actors Click here to access |
RECOLA |
No. Participants: 46 Language(s): US English Description: 3.8 hours of recordings by 46 participants; negative and positive sentiment (valence and arousal) Click here to access |
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) |
No. Recordings: 7,356 No. Participants: 247 File Size: 24.8Gb Filetype: WAV Language(s): US English Description: 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically matched statements in a neutral North American accent Click here to access |
SAVEE Dataset |
No. Recordings: 480 No. Participants: 4 Filetype: MP4 Language(s): US English Description: 4 male actors in 7 different emotions, 480 British English utterances in total Click here to access |
SEMAINE |
No. Recordings: 95 No. Participants: 21 Language(s): US English Description: 95 dyadic conversations from 21 subjects. Each subject converses with another playing one of four characters with emotions Click here to access |
ShEMO3000 |
No. Recordings: 3,000 No. Participants: 87 Filetype: WAV Language(s): US English Description: Semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers Click here to access |
Spoken Commands dataset |
No. Recordings: 10,000,000 File Size: 10MB per word Language(s): US English Description: A testbed for voice activity detection algorithms and for recognition of syllables (single-word commands). 3 speakers, 1,500 recordings (50 of each digit per speaker), English pronunciations Click here to access |
Tess |
No. Recordings: 2,800 No. Participants: 2 Filetype: WAV Language(s): US English Description: 2,800 recordings by 2 actresses; 7 emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutrality. Click here to access |
Thorsten dataset |
No. Recordings: 22668 Filetype: WAV Language(s): US English Description: German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average. Click here to access |
URDU-Dataset |
No. Recordings: 400 No. Participants: 38 Filetype: WAV Language(s): US English Description: 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad. Click here to access |
VCTK dataset |
No. Recordings: 44,000 No. Participants: 110 File Size: 10.94GB Filetype: TXT Language(s): US English Description: 110 English speakers with various accents; each speaker reads out about 400 sentences. Samples are mostly 2–6 s long, at 48 kHz 16 bits, for a total dataset size of ~10 GiB. Click here to access |
VIVAE |
No. Recordings: 1,085 No. Participants: 12 File Size: 93.5MB Filetype: VIVAE Language(s): US English Description: Non-speech, 1085 audio files by ~12 speakers; non-speech 6 emotions: achievement, anger, fear, pain, pleasure, and surprise with 3 emotional intensities (low, moderate, strong, peak). Click here to access |
VoxPopuli |
No. Recordings: 400,000 File Size: 6.4T Filetype: WAV Language(s): US English Description: 100K hours of unlabelled speech data for 23 languages, 1.8K hours of transcribed speech data for 16 languages, and 17.3K hours of speech-to-speech interpretation data for 16×15 directions. Click here to access |
Open Datasets – Video
Twenty Billion Neurons Crowd Acting video dataset collection |
No. Recordings: 220847 File Size: 19.4GB Filetype: WEBM Language(s): US English Description: Large-scale Human-centric Video Analysis in Complex Events Click here to access |
The VIRAT Video Dataset |
No. Recordings: 262 File Size: 12MB Filetype: PDF Language(s): US English Description: The VIRAT Video Dataset is designed to be realistic, natural, and challenging for video surveillance domains in terms of its resolution, background clutter, diversity in scenes, and human activity/event categories than existing action recognition datasets Click here to access |
The WebVid-10M Dataset |
No. Recordings: 10700000 File Size: 2.5MB Filetype: MP4 Language(s): US English Description: A large-scale dataset of short videos with textual descriptions sourced from the web Click here to access |
The MECCANO Dataset |
No. Recordings: 73206 No. Participants: 93 File Size: 32.3GB Filetype: MP4 Language(s): US English Description: The first dataset of egocentric videos to study human-object interactions in industrial-like settings. Click here to access |
Moments In Time |
No. Recordings: 1,000,000 File Size: 150MB Filetype: MP4 Language(s): US English Description: A large-scale dataset for recognizing and understanding action in videos Click here to access |
Something Something Dataset |
No. Recordings: 220847 File Size: 19.4GB Filetype: WEBM Language(s): US English Description: A large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects Click here to access |
BDD100K |
No. Recordings: 100000 File Size: 3.9GB Filetype: MP4 Language(s): US English Description: Comprises ten tasks and 100K videos to estimate the progress of image recognition algorithms on autonomous driving Click here to access |
Kinetics-700 |
No. Recordings: 650,000 File Size: 24.3MB Filetype: MP4 Language(s): US English Description: A large, high-quality video dataset of URL links to approximately 650000 Youtube video clips that cover 700 human action classes. Click here to access |
Casual Conversations Dataset |
No. Recordings: 45,186 No. Participants: 3011 File Size: 15GB Filetype: MP4 Language(s): US English Description: 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models in computer vision and audio applications Click here to access |
VoxCeleb |
No. Recordings: 1,000,000 No. Participants: 7,000 File Size: 133MB Filetype: MP4 Language(s): US English Description: An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube Click here to access |
TV Human Interaction Dataset |
No. Recordings: 300 File Size: 156MB Filetype: MP4 Language(s): US English Description: 300+ videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss Click here to access |
THUMOS Dataset |
No. Recordings: 25,000,000 File Size: 385KB Filetype: MP4 Language(s): US English Description: A large collection of video clips of different kinds; the dataset can be used for action classification Click here to access |
50 Salads Dataset |
No. Participants: 25 File Size: 31GB Filetype: RGB Language(s): US English Description: Fully annotated 4.5-hour dataset of RGB-D video + accelerometer data, capturing 25 people preparing two mixed salads each. Click here to access |
YoutubeFace |
No. Recordings: 3425 No. Participants: 1595 Filetype: MP4 Language(s): US English Description: A database of face videos designed for studying the problem of unconstrained face recognition in videos. Click here to access |
PaSc |
No. Recordings: 9376 No. Participants: 293 Language(s): US English Description: Facial recognition 9,376 still images and 2,802 videos of 293 people Click here to access |
iQIYI-VID |
No. Recordings: 600000 No. Participants: 5000 Filetype: MP4 Language(s): US English Description: The largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. Click here to access |
COIN |
No. Recordings: 11827 File Size: 8.47MB Filetype: JSON Language(s): US English Description: 11,827 videos related to 180 different tasks, which were all collected from YouTube Click here to access |
CityScapes |
No. Recordings: 25000 File Size: 51.92GB Filetype: JPG Language(s): US English Description: A large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities Click here to access |
AVA-Kinetics Dataset |
No. Recordings: 3650000 No. Participants: 39000 File Size: 7.7MB Filetype: CSV Language(s): US English Description: AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Click here to access |
Activity Net |
No. Recordings: 20,194 File Size: 600GB Filetype: JSON Language(s): US English Description: A Large-Scale Video Benchmark for Human Activity Understanding Click here to access |
Kinetics |
No. Recordings: 650000 File Size: 24.3MB Filetype: MP4 Language(s): US English Description: A collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Click here to access |
Yahoo-Flickr Creative Commons 100 Million Dataset |
No. Recordings: 100000000 File Size: 15GB Filetype: MP4 Language(s): US English Description: The YFCC100M is the largest publicly and freely usable multimedia collection, containing around 99.2 million photos and 0.8 million videos from Flickr, all of which were shared under one of the various Creative Commons licenses Click here to access |
UMDFaces |
No. Recordings: 4067888 No. Participants: 11377 File Size: 173MB Filetype: MP4 Language(s): US English Description: UMDFaces is a face dataset divided into two parts: Still Images – 367,888 face annotations for 8,277 subjects and Video Frames – Over 3.7 million annotated video frames from over 22,000 videos of 3100 subjects. Click here to access |
Condensed Movies |
No. Recordings: 462,000 File Size: 250GB Filetype: MP4 Language(s): US English Description: A large-scale video dataset, featuring clips from movies with detailed captions Click here to access |
AVSpeech |
No. Recordings: 290,000 File Size: 128MB Filetype: MP4 Language(s): US English Description: AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering background noises Click here to access |
EyeC3D |
No. Participants: 21 File Size: 3.9GB Language(s): US English Description: 3D video eye tracking dataset Click here to access |
MoVi |
No. Recordings: 1890 No. Participants: 90 File Size: 1.3MB Filetype: MP4 Language(s): US English Description: A large multi-purpose human motion and video dataset Click here to access |
Thör |
No. Recordings: 22668 File Size: WAV Language(s): US English Description: A public dataset of human motion trajectories, recorded in a controlled indoor experiment. Click here to access |
SEWA |
No. Participants: 398 Filetype: WAV Language(s): US English Description: More than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal. Click here to access |
Other Languages
The SIWIS French Speech Synthesis Database |
No. Recordings: 9,750 File Size: 2.671Gb Filetype: .WAV Language(s): French Description: High-quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis Click here to access |
TCOF : Traitement de Corpus Oraux en Français |
No. Recordings: 626 Filetype: .WAV Language(s): French Description: The corpus made available includes two main categories: recordings of adult-child interactions (children up to 7 years old) and recordings of interactions between adults. The recordings are of various durations: from 5 to 45 minutes or more. Click here to access |
African Accented French |
No. Participants: 84 File Size: 1.8Gb Filetype: .WAV Language(s): French Description: This corpus consists of approximately 22 hours of speech recordings. It has recordings from 84 speakers, 48 male, and 36 female. Click here to access |
Fisher Spanish Speech |
No. Participants: 136 No. Recordings: 819 Filetype: .WAV Language(s): Spanish Description: This corpus consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers. Click here to access |
CallFriend – Spanish Corpus |
No. Participants: 120 No. Recordings: 60 Filetype: .WAV Language(s): Spanish Description: The CallFriend Spanish corpus of telephone speech was collected by the Linguistic Data Consortium primarily in support of the project on Language Identification (LID), sponsored by the U.S. Department of Defense and consists of 60 unscripted telephone conversations between native speakers of Spanish for each dialect group Click here to access |
TV3Parla |
File Size: 27.6Gb Filetype: .WAV Language(s): Catalan Description: This corpus includes 240 hours of Catalan speech from broadcast material. Click here to access |
emotiontts_open_db |
Filetype: .WAV Language(s): Korean Description: Recordings and their associated transcriptions by a diverse group of speakers covering 4 emotions: general, joy, anger, and sadness. Click here to access |
Pansori TEDxKR |
No. Participants: 41 No. Recordings: 60 File Size: 174Mb Filetype: .FLAC Language(s): Korean Description: The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers. This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Click here to access |
EMOVO |
No. Participants: 6 No. Recordings: 84 File Size: 237Mb Filetype: .WAV Language(s): Italian Description: This dataset consists of 6 actors who recite 14 sentences in 6 different emotions: disgust, fear, anger, joy, surprise, sadness. Click here to access |
Online gaming voice chat corpus (OGVC) |
No. Participants – 17 No. Recordings: 2,656 Filetype: .WAV Language(s): Japanese Description: This speech material contains 2,656 acted utterances spoken by four professional actors (two male and two female). 17 short dialogues were selected from the dialogues recorded for the naturalistic emotional speech. The actors were instructed to speak each utterance in the short dialog with a specific emotion in three different levels of emotional intensity. Click here to access |
Keio University Japanese Emotional Speech Database (Keio-ESD) |
No. Participants – 1 Filetype: .WAV Language(s): Japanese A set of human speech with vocal emotion spoken by a Japanese male speaker and a set of artificial speech that were synthesized by a system that had been developed using the subset of this database for training. Click here to access |
NST Danish ASR Database |
No. Participants – 616 No. Recordings: 229,992 Filetype: .WAV Language(s): Danish Description: This database was created by Nordic Language Technology for the development of automatic speech recognition and dictation in Danish. Click here to access |
NST Danish Dictation |
No. Participants – 151 No. Recordings: 34,955 Filetype: .WAV Language(s): Danish Description: This database contains speech data for Danish, made for dictation. Click here to access |
NST Danish Speech Synthesis |
No. Participants – 1 No. Recordings: 4,108 Filetype: .WAV Language(s): Danish Description: This database contains speech data for Danish, made for speech synthesis. Click here to access |
FT Speech |
No. Participants: 434 No. Recordings: 1,017,244 Filetype: .WAV Language: Danish Description: FT Speech is a new speech corpus created from the recorded meetings of the Danish Parliament, also known as the Folketing (FT). It contains over 1,800 hours of transcribed speech by a total of 434 speakers, which are partitioned into five subsets with no speaker overlap between train, development, and test data. Click here to access |
FalaBrasil-LaPS Benchmark |
No. Participants: 35 No. Recordings: 700 Filetype: .WAV Language: Portuguese Description: LaPS is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded at 22.05 kHz without environment control. Click here to access |
M-AILABS Polish Corpus |
No. Participants: 35 No. Recordings: 700 File Size: 110Gb Filetype: .WAV Language: Polish Description: The M-AILABS Speech Dataset is the first large dataset that we are providing free-of-charge, freely usable as training data for speech recognition and speech synthesis. Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly 1,000 hours of audio and text files in a prepared format. The texts were published between 1884 and 1964, and are in the public domain. Click here to access |
Estonian |
No. Participants: 10 No. Recordings: 1,040 Filetype: .WAV Language: Estonian Description: 26 text passages read by 10 speakers, covering 4 main emotions: joy, sadness, anger, and neutral. Click here to access/ |
AESDD |
No. Participants: 10 No. Recordings: 500 File Size: 391Mb Filetype: .WAV Language: Greek Description: The Acted Emotional Speech Dynamic Database (AESDD) is a publically available speech emotion recognition dataset. It contains utterances of acted emotional speech in the Greek language. The dataset consists of 500 utterances recorded by a diverse group of actors covering 5 different emotions: anger, disgust, fear, happiness, and sadness. Click here to access/ |
Microsoft Speech Corpus (Indian languages) |
No. Recordings: 124,599 Filetype: .WAV Languages: Telugu; Tamil; Gujarati Description: Microsoft Speech Corpus (Indian languages) release contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. The data package includes audio and corresponding transcripts. Click here to access |
Tunisian |
No. Participants: 118 No. Recordings: 11.2 hours File Size: 391Mb Filetype: .WAV Language: Greek Description: MSA Modern Standard Arabic (Tunisia) 118 speakers Click here to access |
AISHELL-1 |
No. Participants: 400 File Size: 15Gb Filetype: .WAV Language: Mandarin Description: Aishell is an open-source Chinese Mandarin speech corpus published by Beijing Shell Shell Technology Co., Ltd. 400 people from different accent areas in China are invited to participate in the recording, which is conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz. The manual transcription accuracy is above 95%, through professional speech annotation and strict quality inspection. The data is free for academic use. Click here to access |
Malayalam Speech Corpus |
No. Participants: 75 File Size: 326Mb Filetype: .WAV Language: Malayalam Description: The Malayalam Speech Corpus (MSC) is one of the first open speech corpora for Automatic Speech Recognition (ASR) research and consists of 250 hours of Agricultural speech data involving 3 female, 12 male, and 60 unidentified participants. Click here to access |
Google Malayalam |
No. Participants: 24 File Size: 1.345Gb Filetype: .WAV Language: Malayalam Description: This data set contains transcribed high-quality audio of Malayalam sentences recorded by volunteers. Click here to access |
Facebook AI is releasing Multilingual LibriSpeech |
File Size: 3Tb Filetype: .WAV Languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish Description: Multilingual Librispeech (MLS) is a large-scale, open-source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services. Click here to access |
The BABEL Project |
Filetype: .WAV Language: Bulgarian, Estonian, Hungarian, Polish, and Romanian Description: BABEL was a joint European project under the COPERNICUS scheme comprising partners from a number of Eastern and Western European research centers. BABEL has produced a multi-language database comprising five of the most widely differing Eastern European languages. Click here to access |
Living Audio Dataset |
Languages: Dutch, English, Irish, Russian Description: A “Crowd-Built” continuously growing speech dataset with transcripts. The dataset contains multiple languages and is intended for anyone to be able to add to it. Click here to access |
Microsoft Speech Language Translation Corpus |
No. Recordings: 61,270 File Size: 326Mb Filetype: .WAV Languages: English, Chinese, Japanese Description: The Microsoft Speech Language Translation Corpus release contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese collected by Microsoft Research. The package includes audio data, transcripts, and translations and allows end-to-end testing of spoken language translation systems on real-world data. Click here to access |
Conclusion
We hope that this list of 150+ open datasets has been helpful in assisting your model-building journey. If there are any open datasets you would like us to add to the list, then please let us know here.
Found a dataset you’d like to use? Click here to access
Need a custom dataset specific to your project? Contact us here