Romanian language statistics and resources for text-to-speech systems (bibtex)
by Adriana Stan, Mircea Giurgiu
Abstract:
This paper introduces a series of results and experiments used in the development of a Romanian text-to-speech system, focusing on text statistics. We investigate the presence of several linguistic units used in text-to-speech systems, from phonemes to words. The text corpus we used, News-Romanian (News-RO) comprises 4500 newspaper articles. A subset of it, around 2500 sentences represents the Romanian Speech Synthesis (RSS) recorded speech database. The results offer an important insight to how should a speech database be designed. We also describe the methods used in the development of a 50,000 words Romanian lexicon with phonetic transcription and accent positioning. Such a lexicon is useful in machine learning algorithms of the front-end part of a text-to-speech system. As an addition we study the use of Maximal Onset Principle for Romanian syllabification.
Reference:
Adriana Stan, Mircea Giurgiu, "Romanian language statistics and resources for text-to-speech systems", In Proceedings of the $9^th$ Edition of the International Symposium on Electronics and Telecommunications, Timisoara, Romania, 2010.
Bibtex Entry:
@inproceedings{ISETC10,
  author = {Adriana Stan and Mircea Giurgiu},
  title = {Romanian language statistics and resources for text-to-speech 
                    systems},
  booktitle = {Proceedings of the $9^{th}$ Edition of the International 
                    Symposium on Electronics and Telecommunications},
  abstract = {This paper introduces a series of results and experiments 
                   used in the development of a Romanian text-to-speech 
                   system, focusing on text statistics. We investigate the 
                   presence of several linguistic units used in text-to-speech 
                   systems, from phonemes to words. The text corpus we used, 
                   News-Romanian (News-RO) comprises 4500 newspaper articles. 
                   A subset of it, around 2500 sentences represents the Romanian 
                   Speech Synthesis (RSS) recorded speech database. The results 
                   offer an important insight to how should a speech database be 
                   designed. We also describe the methods used in the development 
                   of a 50,000 words Romanian lexicon with phonetic transcription 
                   and accent positioning. Such a lexicon is useful in machine 
                   learning algorithms of the front-end part of a text-to-speech 
                   system. As an addition we study the use of Maximal Onset 
                   Principle for Romanian syllabification.},
  year = {2010},
  address = {Timisoara, Romania}
}
Powered by bibtexbrowser