MARA Corpus - A Large Expressive Romanian Speech Corpus

The MARA Speech Corpus

This corpus contains the audiobook Mara written by Ioan Slavici and published in 1906. The corpus consists of approximately 11 hours of speech recorded by a female speaker in a controlled environment.

The audiobook was kindly provided by Mihai Nae from Cartea Sonora.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. When using it in your work, please acknowledge Cartea Sonora.

An Expressive Romanian Speech Dataset

The MARA Corpus is a result of the SINTERO Project, funded by the Romanian Ministry of Research and Innovation, PCCDI – UEFISCDI, project number PN-III-P1-1.2-PCCDI-2017-0818/73, within PNCDI III.

The corpus contains over 11 hours of high quality data recorded by a professional female speaker. The data is orthographically transcribed, manually segmented at utterance level and semi-automatically aligned at phone-level. The associated text is processed by a complete linguistic feature extractor composed of: text normalisation, phonetic transcription, syllabification, lexical stress assignment, lemma extraction, part-of-speech tagging, chunking and dependency parsing.

The data was manually segmented into 8185 utterances and inconsistencies between the text and audio were corrected. The average length of the utterances is around 5 seconds corresponding to approximately 12 words.

You can listen to natural audio samples, as well as samples of synthetic voices built from the MARA corpus HERE.

Download

Please fill-in and sign the LICENSE AGREEMENT form and return it to adriana.stan@com.utcluj.ro to acces the corpus.

The MARA Speech Corpus

An Expressive Romanian Speech Dataset

Download

Developers

Team

Adriana STAN

Beáta LŐRINCZ

Maria NUȚU

Mircea GIURGIU