| Text-to-Speech(TTS),as one of the key technologies for the next generation of human-computer interaction,is an important research direction in the field of computer speech.The main problems of existing methods are reflected in the poor naturalness of synthesized speech,the low customization and the limited application scenarios.In this thesis,research is carried out around the problems of practical Chinese TTS,which mainly include the following three aspects.In the first part,a Chinese TTS is proposed to solve the problems of poor speech naturalness in the traditional methods,unstable and slow synthesis in the system based on deep learning.We explore the application of end-to-end speech synthesis in Chinese,so that the synthesized speech can better fit human pronunciation.The attention mechanism with constraints is proposed to solve the problem of synthesis stability.At the same time,we combine neural network with Griff-in-Lim to increase the generation speed on CPU from 0.061kHz to 21.837kHz.The proposed Chinese TTS can generate natural speech stably and quickly.In the second part,in order to compensate for the shortcomings of traditional methods which need a large number of corpus with voice clone,the personalized TTS is explored in customized speech scenarios such as star speech synthesis and real-time speech imitation.Two solutions proposed in this thesis can use less than 10 minutes of audio to quickly transfer the synthesized speech into the target speaker's accent.The solutions are fine-tuning model and adding the voice conversion module.The former explores the application of transfer learning in the field of TTS,which is simple and effective.The latter uses semi-supervised learning to reduce the training difficulty of voice conversion module.After adding the module to the original system,the customized speech synthesis can be realized.In the third part,aiming at the problem that the input text is multi-lingual and the ability of model to reproduce speaker's voice is weak,cross-lingual,multi-speaker speech synthesis is discussed.In this thesis,we study and design a cross-lingual text front-end.After that,we propose a speaker encoder network that can efficiently extract the speaker features of input speech and generate the fixed length speaker embedding vector.And we discuss the feature fusion of speaker embedding vector and acoustic feature generation network.The proposed model can generate the speech of multiple speakers even unseen speaker in different languages,including Chinese.In this thesis,different evaluation methods are used to verify the effectiveness of proposed solutions.On the mean opinion score of Chinese TTS system,the proposed solution reaches 4.048.This is higher than 3.480 of Google splicing TTS and 3.790 of Google parametric TTS.On the personalized TTS system,the mean opinion score is 3.560.And on the cross-lingual,multi-speaker speech synthesis,the mean opinion score for the naturalness of the synthesized speech is 3.762,while the mean opinion score for similarity is 3.418.Compared with the similar research,the proposed models have significant advantages in terms of corpus requirements,speech naturalness and similarity. |