Check out our off-the-shelf data sets on speech and voice recognition. Need to find your own dataset? Get a free quote for custom data by contacting us.
VoxCeleb is a large-scale audio-visual speech dataset built from YouTube interview clips, widely used to train and benchmark deep speaker recognition models for speaker verification, speaker identification, and robust “in-the-wild” voice AI.
Casual Conversations is a large scale multimodal (video + audio) benchmark dataset built to evaluate and audit computer vision and speech models for accuracy across diverse ages, genders, apparent skin tones, and lighting conditions.
These expressions are produced at two levels of emotional intensities (regular and strong) except for the neutral emotion that only contains regular intensity.
Each video contains some number of procedure steps to fulfill a recipe. All the procedure segments are temporal localized in the video with starting time and ending time. The distributions of 1) video duration, 2) number of recipe steps per video, 3) recipe segment duration and 4) number of words per sentence are shown below.
This corpus contains 15 spontaneous dialogues and multi-participant conversations by deaf signers, 10 of which were recorded in authentic settings like a deaf club and a bar, 5 were recorded in the lab.
Wake word "Alexa" in EU Spanish (es_ES) (e.g., "Eh, Alexa, cuéntame un chiste."). Each participant has recorded on average 70 utterances (minimum 50, maximum 75). This data set contains the voice command only.