Adding expressiveness to unit selection speech synthesis and to numerical voice production
Mr. Marc Freixes' Thesis Lecture
Speech is one of the most natural and direct forms of communication between human beings, as it codifies both a message and paralinguistic cues about the emotional state of the speaker, its mood, or its intention, thus becoming instrumental in pursuing a more natural Human Computer Interaction (HCI). In this context, the generation of expressive speech for the HCI output channel is a key element in the development of assistive technologies or personal assistants among other applications.
Synthetic speech can be generated from recorded speech using corpus-based methods such as Unit-Selection (US), which can achieve high quality results but whose expressiveness is restricted to that available in the speech corpus. In order to improve the quality of the synthesis output, the current trend is to build ever larger speech databases, especially following the so-called End-to-End synthesis approach based on deep learning techniques. However, recording ad-hoc corpora for each and every desired expressive style can be extremely costly, or even unfeasible if the speaker is unable to properly perform the styles required for a given application (e.g., singing in the storytelling domain). Alternatively, new methods based on the physics of voice production have been developed in the last decade thanks to the increase in computing power. For instance, vowels or diphthongs can be obtained using the Finite Element Method (FEM) to simulate the propagation of acoustic waves through a 3D realistic vocal tract geometry obtained from Magnetic Resonance Imaging (MRI). However, since the main efforts in these numerical voice production methods have been focused on improving the modelling of the voice generation process, little attention has been paid to its expressiveness up to now. Furthermore, the collection of data for such simulations is very costly, besides requiring manual time-consuming postprocessing like that needed to extract 3D vocal tract geometries from MRI.
The aim of the thesis is to add expressiveness into a system that generates neutral voice, without having to acquire expressive data from the original speaker. One the one hand, expressive capabilities are added to a Unit-Selection Text-to-Speech (US-TTS) system fed with a neutral speech corpus, to address specific and timely needs in the storytelling domain, such as for singing or in suspenseful situations. To this end, speech is parameterised using a harmonic-based model and subsequently transformed to the target expressive style according to an expert system. A first approach dealing with the synthesis of storytelling increasing suspense shows the viability of the proposal in terms of naturalness and storytelling quality. Singing capabilities are also added to the US-TTS system through the integration of Speech-to-Singing (STS) transformation modules into the TTS pipeline, and by incorporating an expressive prosody generation module that allows the US to select units closer to the target singing prosody obtained from the input score. This results in a Unit Selection based Text-to-Speech-and-Singing (US-TTS&S) synthesis framework that can generate both speech and singing from the same neutral speech small corpus (~2.6 h). According to the objective results, the score-driven US strategy can reduce the pitch scaling factors required to produce singing from the selected spoken units, but its effectiveness is limited regarding the time-scale requirements due to the short duration of the spoken vowels. Results from the perceptual tests show that although the obtained naturalness is obviously far from that given by a professional singing synthesiser, the framework can address eventual singing needs for synthetic storytelling with a reasonable quality.
The incorporation of expressiveness is also investigated in the 3D FEM-based numerical simulation of vowels through modifications of the glottal flow signals following a source-filter approach of voice production. These signals are generated using a Liljencrants-Fant (LF) model controlled with the glottal shape parameter Rd, which allows exploring the tense-lax continuum of phonation besides the spoken vocal range of fundamental frequency values, F0. The contribution of the glottal source to higher order modes in the FEM synthesis of cardinal vowels [a], [i] and [u] is analysed through the comparison of the High Frequency Energy (HFE) values obtained with realistic and simplified 3D geometries of the vocal tract. The simulations indicate that higher order modes are expected to be perceptually relevant according to reference values stated in the literature, particularly for tense phonations and/or high F0s. Conversely, vowels with a lax phonation and/or low F0s can result in inaudible HFE levels, especially if aspiration noise is not present in the glottal source. After this preliminary study, the excitation characteristics of happy and aggressive vowels from a Spanish parallel speech corpus are analysed with the aim of incorporating this tense voice expressive styles into the numerical production of voice. To that effect, the GlottDNN vocoder is used to analyse F0 and spectral tilt variations associated with the glottal excitation on vowels [a]. These variations are mapped through the comparison with synthetic vowels into F0 and Rd values to simulate vowels resembling happy and aggressive styles. Results show that it is necessary to increase F0 and decrease Rd with respect to neutral speech, with larger variations for happy than aggressive style, especially for the stressed [a] vowels.
The results achieved in the conducted investigations validate the possibility of adding expressiveness to both corpus-based US-TTS synthesis and FEM-based numerical simulation of voice. Nevertheless, there is still room for improvement. For instance, the strategy applied to the numerical voice production could be improved by studying and developing inverse filtering approaches as well as incorporating modifications of the vocal tract, whereas the developed US-TTS&S framework could benefit from advances in voice transformation techniques including voice quality modifications, taking advantage of the experience gained in the numerical simulation of expressive vowels.