Portuguese is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Portuguese language speech datasets with a specific dialect or type of speech to train your models.
That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Portuguese Language speech datasets.
Are you ready?
Let’s dive into our list of the best Portuguese Language speech datasets in 2022.
Here are our top picks for Portuguese Language speech datasets:
1. How2 Dataset
Created by Sanabria in 2018, the How2 Dataset of instructional videos covers a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles., in Portuguese, English language. Containing ~2,000 Hours.
2. Portuguese SQuAD V1.1 Dataset
Created by Carvalho et al. in 2019, the Portuguese SQuAD v1.1 Portuguese translation of the SQuAD dataset. The translation was performed using the Google Cloud API., in Portuguese language. Containing ~100,000 in JSON file format.
3. OSCAR Corpus Dataset
Created by Pedro et al. in 2020, the BrWaC (Brazilian Portuguese Web as Corpus) or OSCAR Corpus is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes., in the Portuguese language. In-Text file format.
4. DNLT-BP Dataset
Created by the Datasets of Neuropsychological Language Tests in Brazilian Portuguese (DNLT-BP). This dataset contains data collected from participants in clinical or academic studies and research, by reading and signing the Informed Consent Form, and the research was evaluated and approved by the Research Ethics Committees of the institutions to which they are linked, in the Portuguese language. In-Text file format.
5. PortugueseGLUE Dataset
The PortugueseGLUE contains a Portuguese translation of the GLUE benchmark and Scitail dataset using the OPUS-MT model and Google Cloud Translation. Collected in the Portuguese language, in Text file format.
6. TweetSentBR Dataset
Created by 2020, the TweetSentBR contains sentiment polarity classification with 800k tweets in Portuguese divided into positive, negative, and neutral classes. Collected in the Portuguese language, in Text file format
7. B5 Corpus Dataset
Created by Ramos et al. in 2018, the B5 Corpus Dataset is a collection of Facebook posts, including information about Brazilian authors, like gender, age, personality score (Based on the B5 test), education level, political position, religion, and others., in the Portuguese language. Containing 1012 in CSV file format.
Wrapping up
To conclude, here are top picks for the best Portuguese Language Speech datasets for your projects:
- How2 Dataset
- Portuguese SQuAD V1.1 Dataset
- OSCAR Corpus Dataset
- DNLT-BP Dataset
- PortugueseGLUE Dataset
- TweetSentBR Dataset
- B5 Corpus Dataset
We hope that this list has either helped you find a dataset for your project. Alternatively, we’ve hoped this list has allowed you to realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.