New thesis dissertation in speech synthesis
Next April 27th CTMedia researcher Lluís Formiga will present his PhD thesis entitled "Perceptual optimization of speech synthesis systems based on unit selection through active interactive genetic algorithms". The the supervisor of this thesis is Dr. Francesc Alías.
The presentation will take place in the Paraninf room, c / Quatre Camins, 30 08022 Barcelona at 12h.
Abstract:
Text-to-Speech systems (TTS) produce synthetic speech from an input text. Unit Selection TTS (US-TTS) systems are based on the retrieval of the best sequence of recorded speech units previously recorded into a database (corpus). The retrieval is done by means of dynamic programming algorithm and a weighted cost function (Hunt and Black, 1996). An expert typically performs the weighting of the cost function by hand (Clark et al., 2007; Schröder et al.,2009).However,hand tuning is costly from a standpoint of previous training and inaccurate in terms of methodology (Strom and King, 2008). In order to properly tune the weights of the cost function, this thesis continues the perceptual tuning proposal submitted by Alías (2006) which uses active interactive Genetic Algorithms (aiGAs) (Llorà et al., 2005b). This thesis conducts an investigation to the various problems that arise in applying aiGAs to the weight tuning of the cost function. Firstly, the thesis makes a deep revision to the state-of-the-art in weight tuning. Afterwards, the thesis outlines the suitability of Interactive Evolutionary Computation (IEC) to perform the weight tuning making a thorough review of previous work (Al ́ıas, 2006). Then, the proposals of improvement are presented. The four major guidelines pursued by this thesis are: accuracy in adjusting the weights, robustness of the weights obtained, the applicability of the methodology to any subcost distance and the consensus of weights obtained by different users. In terms of precision cluster-level perceptual tuning is proposed in order to obtain weights for different types (clusters) of units considering their phonetic and contextual properties. In terms of robustness of the evolutionary process, the thesis presents different metrics (indicators) to assess aspects such as the ambiguity within the evolutionary search, the convergence of one user or the level of consensus among different users. Subsequently, to study the applicability of the proposed methodology different weights are perceptually tuned combining linguistic and symbolic information. The last contribution of this thesis examines the suitability of latent models for modeling the preferences of different users and obtains a consensus solution. In addition, the experimentation is carried out through a medium size corpus (1.9h) automatically labelled in order fill the gap between the proof-of-principle and a real unit selection scenario. The thesis concludes that aiGAs are highly competitive in comparison to other weight tuning techniques from the state-of-the-art.