Current projects


Tehnologii de realizare a interfețelor om-mașină pentru sinteza text-vorbire cu expresivitate (2018-2020)


Sistemele de sinteză text-vorbire au atins în ultimii ani un nivel de naturalețe a vocii sintetizate foarte ridicat, astfel încât utilizarea lor în aplicații comerciale de automatizare a interacțiunii om-mașină devine din ce în ce mai larg utilizată și extrem de profitabilă. Cu toate acestea, există o serie de limitări ale acestor sisteme. O primă limitare provine din numărul de voci sintetizate disponibile pentru o anumită limbă, fapt ce nu permite personalizarea sistemului de sinteză pentru anumite cerințe specifice ale clienților (de ex. sinteza vocală a unei cărți cu propria voce). Tipic, pentru a crea o nouă voce sintetizată de calitate este nevoie ca un vorbitor să petreacă un timp îndelungat într-un studio de înregistrări, ajungând chiar la zeci sau sute de ore de date colectate. Metodele recente de adaptare a vorbitorilor în cadrul sistemelor de sinteză parametrice pot să reducă acest timp până la ordinul zecilor de minute, însă rezultatele nu sunt întotdeauna cele mai bune. O a doua limitare se referă la expresivitatea acestor sisteme. Dacă pentru sistemele de informare vocală, cu mesaje scurte, lipsa expresivității nu este problematică, pentru redarea unor texte mai lungi sau a unui stil verbal diferit de cel informativ, provocările științifice și tehnice sunt avansate deoarece expresivitatea este foarte greu de formalizat într-un limbaj abstract, compact și ușor de transpus tehnic într-un set de instrucțiuni programatice.

Past projects


Mobile System for Rehabilitative Vocal Assistance of Surgical Aphonia (2014-2016)


With the advances in voice conservative surgeries and radiotherapy techniques, most patients with cancer of the larynx can be cured. However, for those who do not respond or present with recurrent or advanced disease, total laryngectomy is the only curative approach that can be offered. The prognosis of laryngectomized patients has remained relatively favourable over the years, with five-year survival rates of 65 - 75%.
Voice is an important component of human identity. When people lose their voices, although there are devices which enable them to use speech again, these devices have a limited number of "identities" and this can lead to a negative psychological impact upon that person. Also, when faced with mutilating surgery, especially in cases where there's a trade-off between the ability to speak, and the maximum cure of disease, people tend to balance towards the use of their own personal voice.
Immediately after a laryngectomy, patients are unable to speak. Most patients find this extremely distressing. The consequences of such a massive communicational disorder are, in general, fear, depression, hopelessness and passivity. From a psychological point of view one can expect that regaining vocal communication quickly, generally facilitates the social and psychological rehabilitation of laryngectomized patients. However, one should pay attention to the fact that many laryngectomized patients, while learning to speak with a voice prosthesis, suffer great emotional distress. At this point, the patients realize that normal speech articulation cannot be attained and that their new voice attracts social attention. Many of the social responses that are experienced by the patients are ambivalent or negative. Thus laryngectomized patients often experience communication failures and open or covert rejection (e.g. early termination of the conversation, interruption by others). As a consequence of the subjective experience of the noticeable difference in their new voice, patients tend to depreciate themselves in terms of stigmatization. Frequently this results in social withdrawal and isolation. Restoring a patientís ability to communicate in all of the daily activities is therefore an essential goal in the patientís complete physical and mental health restoration.



Speech Synthesis that Improves Through Adaptive Learning (2011-2014)


In order to be accepted by users, the voice of a spoken interaction system must be natural and appropriate for the content. Using the same voice for every application is not acceptable to users. But creating a speech synthesiser for a new language or domain is too expensive, because current technology relies on labelled data and human expertise. Systems comprise rules, statistical models, and data, requiring careful tuning by experienced engineers.

So, speech synthesis is available from a small number of vendors, offering generic products, not tailored to any application domain. Systems are not portable: creating a bespoke system for a specific application is hard, because it involves substantial effort to re-engineer every component of the system. Take-up by potential end users is limited; the range of feasible applications is narrow. Synthesis is often an off-the-shelf component, providing a highly inappropriate speaking style for applications such as dialogue, speech translation, games, personal assistants, communication aids, SMS-to-speech conversion, e-learning, toys and a multitude of other applications where a specific speaking style is important.

We are developing methods that enable the construction of systems from audio and text data. We are enabling systems to learn after deployment. General purpose or specialised systems for any domain or language will become feasible. Our objectives are:

  • ADAPTABILITY: create highly portable and adaptable speech synthesis technology suitable for any domain or language
    LEARNING FROM DATA AND INTERACTION: provide a complete, consistent framework in which every component of a speech synthesis system can be learned and improved
  • SPEAKING STYLE: enable the generation of natural, conversational, highly expressive synthetic speech which is appropriate to the wider context

  • DEMONSTRATION AND EVALUATION: automatic creation of a new speech synthesiser from scratch, and feedback-driven online learning, with perceptual evaluations.


Sound2Sense (2007-2011)


S2S is an interdisciplinary EC-funded Marie Curie Research Training Network (MC-RTN) involving engineers, computer scientists, psychologists, and linguistic phoneticians.
We use a variety of approaches to investigate what types of information are available in the speech signal, and how listeners use that information when they are listening in their native language, or in a foreign language, or in a noisy place like a railway station, when it is hard to hear the speech. These three types of listening situation allow us to see how listeners actively use their knowledge, together with the speech they hear, to understand a message.
Recent research shows that quite fine phonetic detail in the speech signal can carry information crucial to successfully understanding every aspect of a message, from its formal linguistic content, like words and grammar, to the interactional structure which keeps a conversation going. This is not the traditional view, and it challenges most models of speech processing, especially in the central role they give to phonemes and syllables. In contrast, two of S2S’s fundamental principles are that phonetic information is encoded in units of different lengths and degrees of complexity, and that any given sound in the signal fulfils multiple communicative functions simultaneously—its fine detail indicating what those functions are.