Speech Recognition

VoxCeleb

VoxCeleb is a large-scale audio-visual speech dataset built from YouTube interview clips, widely used to train and benchmark deep speaker recognition models for speaker verification, speaker identification, and robust “in-the-wild” voice AI.

Casual Conversations Dataset

Casual Conversations is a large scale multimodal (video + audio) benchmark dataset built to evaluate and audit computer vision and speech models for accuracy across diverse ages, genders, apparent skin tones, and lighting conditions.

Audio-visual speech with multiple speakers

Large-scale audio-visual dataset comprising speech clips with no interfering background signals.

Activity Detection

Audio-visual emotion recognition

These expressions are produced at two levels of emotional intensities (regular and strong) except for the neutral emotion that only contains regular intensity.

Activity Detection

Instructional cooking videos

Each video contains some number of procedure steps to fulfill a recipe. All the procedure segments are temporal localized in the video with starting time and ending time. The distributions of 1) video duration, 2) number of recipe steps per video, 3) recipe segment duration and 4) number of words per sentence are shown below.

Biometrics

audio-visual recordings of sign language

This corpus contains 15 spontaneous dialogues and multi-participant conversations by deaf signers, 10 of which were recorded in authentic settings like a deaf club and a bar, 5 were recorded in the lab.

Biometrics

A dataset of videos of talking faces with transcriptions

Data were collected from 100 subjects, yielding over thousand instances of synchronized data

Biometrics

Lip Reading in the Wild (LRW)

The package including the videos and the metadata is available for non-commercial, academic research.

Mandarin (Shanghai) (China) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Mandarin spoken in Shanghai, China

Romanian (Romania) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Romanian spoken in Romania

Polish (Poland) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Polish spoken in Poland

Panjabi (Pakistan) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Panjabi spoken in Pakistan

Mongolian (Mongolia) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Mongolian spoken in Mongolia

Mandarin (Traditional) (Taiwan) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Mandarin (Traditional) spoken in Taiwan

Mandarin (Simplified) (China) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Mandarin (simplified) spoken in China

Lao (Laos) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Lao spoken in Laos

Kannada (India) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Kannada spoken in India

Greek (Greece) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Spanish spoken in Spain

German (Switzerland) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, German spoken in Switzerland

French (Algeria) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, French spoken in Algeria

Farsi/Persian (Iran) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Farsi/Persian spoken in Iran

English (UAE) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, English spoken in UAE

English (Philippines) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, English spoken in Philippines

English (Hong Kong) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, English spoken in Hong Kong

English (Australia) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, English spoken in Australia

Dutch (Netherland) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Dutch spoken in Netherland

Dutch (Belgium) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Dutch spoken in Belgium

Spanish (Mexico) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Spanish spoken in Mexico

Spanish (ESP) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Spanish spoken in Spain

Catalan (Spain) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Catalan spoken in Catalonia, Spain.

Sinhalese (Sri Lanka) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Sinhalese spoken in Sri Lanka

Vietnamese General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Vietnamese spoken in Vietnam

Tamil General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Tamil spoken in India

Singaporean-English General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Singaporean-English spoken in Singapore

Punjabi General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Punjabi spoken in India

Malay General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Malay spoken in Malaysia

Bahasa (Indonesia) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Bahasa spoken in Indonesia

Thai General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Thai spoken in Thailand

Gujarati General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Gujarati spoken in India.

UK Accents Dataset

Various UK Accents

Arabic (Saudi Arabia) language conversational telephony

Dataset is fully transcribed and timestamped.

Arabic (Dubai) language conversational telephony

Dataset is fully transcribed and timestamped.

UK Voice Commands Dataset

Voice Commands in the English Language

Portuguese (PT) language conversational telephony

Dataset is fully transcribed and timestamped.

German language conversational telephony

Dataset is fully transcribed and timestamped.

Bengali (Bangladesh) conversational telephony

Dataset is fully transcribed and timestamped.

Phone Conversations in Hindi

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in Japanese

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in Spanish

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in French

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in Indian English

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in Irish English

2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two Irish speakers

Alexa Wake Words in Canadian French (Adults)

This data set contains recordings of the wake word "Alexa" in Canadian French (fr_CA) (e.g., "Alexa, raconte-moi une blague.").

Alexa Voice Commands in EU Spanish (Adults)

Wake word "Alexa" in EU Spanish (es_ES) (e.g., "Eh, Alexa, cuéntame un chiste."). Each participant has recorded on average 70 utterances (minimum 50, maximum 75). This data set contains the voice command only.

Alexa Wake Words in Mexican Spanish (Adults)

This data set contains recordings of the wake word "Alexa" in Mexican Spanish (es_MX) used in voice commands (e.g., "Oye Alexa, cuéntame un chiste.").

Wake Words

Siri Wake Words and Voice Commands in US English

US English voice commands including the wake word "Hey Siri" from 103 participants of age 19-68.

Wake Words and Voice Commands in Korean with Seoul Dialect

Korean voice commands including the wake word "Hi Bixby" from 52 participants in Seoul.