Audio-visual speech with multiple speakers

Large-scale audio-visual dataset comprising speech clips with no interfering background signals.
Files
100
Size
4 GB
Format
wav
Duration
8 hours
Country
Worldwide
Participants
40
Languages
English
Updated
December 8, 2025

Description

The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 20 hours of video segments with approximately 40 distinct speakers, spanning a wide variety of people, languages and face poses.

Licence

Twine Commercial License

Version Info

Version:
Last updated:
Owner:
1.2
December 8, 2025
Twine AI

Dataset Technical Specification

Number of files:
100
Total dataset size:
4 GB
Duration:
8 hours
Format:
wav
Sample rate:
48 Khz
Resolution:
N/A

Dataset Demographics

📍 Country:
Worldwide
🧍 Gender:
M/F 60-40%
📅 Age:
18-55
👥 Number of participants:
40

🛡️ Consent & Compliance

✅ GDPR Compliant

Consent Summary:

All participants provided informed consent for data collection and usage in AI training applications.

Data Collection:

Data was collected through the Boom app phone conversation module with full participant awareness and agreement.

Ethical Considerations:

All data has been anonymized and privacy-preserving measures have been implemented to protect participant identities.