The Best Turkish Language Datasets of 2022

Turkish is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Turkish language datasets to train your models.

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Turkish Language datasets.

Are you ready?

Let’s dive in.

Here are our top picks for Turkish Language datasets:

1. TS Corpus Project

TS Corpus is a Free & Independent Project that aims to build Turkish corpora, develop Natural Language Processing tools, and compile linguistic datasets. Users are free to run queries, save queries and download the hit sets to their computers. All the 14 published corpora serve a dataset of over 1.3 billion tokens derived from various sources; online newspapers, forums, social media, academic papers, etc.

Access the dataset

2. Turkish National Corpus (TNC)

Turkish National Corpus is designed to be a balanced, large-scale (50 million words) and general-purpose corpus for contemporary Turkish. It consists of samples of textual data across a wide variety of genres covering a period of 24 years (1990-2013).

The written component consists of texts produced in different domains on various topics. Transcriptions from spoken data constitute 2% of TNC’s database, which involves spontaneous, everyday conversations and speeches collected in particular communicative settings.

Access the dataset

3. Bilkent Turkish Writings Dataset

This dataset contains content from Turkish creative writing courses between 2014-2018. All in all, there are nearly 7,000 texts available for download in CSV format.

Access the dataset

4. English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset

TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

There are two noise reduction methodologies: (a) domain-dependent and (b) domain-independent in post-processing raw collections. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

Access the dataset

5. Middle East Technical University Turkish Microphone Speech v 1.0

Middle East Technical University Turkish Microphone Speech v 1.0 was developed at Middle East Technical University (METU) and contains text, speech, and alignment files for approximately 5.6 hours of recorded Turkish.

The corpus is of a size of ~600 MB. 120 speakers (60 male and 60 female) speak 40 sentences each (approximately 300 words per speaker). The 40 sentences are selected randomly for each speaker from a triphone-balanced set of 2,462 Turkish sentences.

The speakers are selected from students, faculty, and staff at METU and all are native speakers of Turkish. The age range is from 19 to 50 years with an average of 23.9 years. The data has been digitally recorded with a Sound Blaster sound card on a PC at a 16 kHz sampling rate.

Access the dataset

6. Turkish Broadcast News Speech and Transcripts

Turkish Broadcast News Speech and Transcripts was developed by Bogaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts.

The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio the 2009 broadcasts were recorded from digital satellite transmissions.

Access the dataset

Wrapping up

To conclude, here are top picks for the best Turkish Language Speech datasets for your projects:

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to learn more about how we could help build a custom dataset for your project, please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

AI datasets machine learning

The Best Turkish Language Datasets of 2022

Here are our top picks for Turkish Language datasets:

1. TS Corpus Project

2. Turkish National Corpus (TNC)

3. Bilkent Turkish Writings Dataset

4. English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset

5. Middle East Technical University Turkish Microphone Speech v 1.0

6. Turkish Broadcast News Speech and Transcripts

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

Efficient Model Training With Image Annotation Outsourcing Services

Best Data Collection Companies for AI

Best Datasets for Semantic Segmentation Models in 2024

The Best Turkish Language Datasets of 2022

Here are our top picks for Turkish Language datasets:

1. TS Corpus Project

2. Turkish National Corpus (TNC)

3. Bilkent Turkish Writings Dataset

4. English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset

5. Middle East Technical University Turkish Microphone Speech v 1.0

6. Turkish Broadcast News Speech and Transcripts

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

You may also like

Efficient Model Training With Image Annotation Outsourcing Services

Best Data Collection Companies for AI

Best Datasets for Semantic Segmentation Models in 2024

Need audio training data?