The SWARA Speech Corpus

A Large Parallel Romanian Read Speech Dataset


The SWARA Corpus is a result of the SWARA Project, funded by the Romanian Ministry of Education, under the grant agreement PN-II-PT-PCCA-2013-4 No 6/2014. The corpus contains over 21 hours of high quality recordings from 17 different speakers. The data is segmented in 19,279 utterances and includes their orthographic transcripts and semi-automatic phone-level alignments.


If you use the SWARA Corpus, please cite the following paper:


Adriana Stan, Florina Dinescu, Cristina Țiple, Șerban Meza, Bogdan Orza, Magdalena Chirilă and Mircea Giurgiu, The SWARA Speech Corpus: A Large Parallel Romanian Read Speech Dataset, in Proceedings of the 9th Conference on Speech Technology and Human-Computer Dialogue, Bucharest, Romania, July 6-9, 2017 pdf | bib


You can listen to audio samples of each speaker, as well as samples of synthetic voices built from the SWARA corpus HERE


The list of the utterances which were read exactly the same by all the speakers can be found HERE



Download

For research use of the corpus, please fill-in the License Agreement and return it to the maintainer of the corpus.



Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License..
THE CONTRIBUTORS TO THIS WORK DISCLAIM ALL WARRANTIES WITH REGARD TO THIS DATA, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE CONTRIBUTORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS DATA.

Developers

Team