Font Size: a A A

Research On Key Technologies Of Personalized Speech Generation In Intelligent Home Environment

Posted on:2016-10-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:W X GaoFull Text:PDF
GTID:1222330503456058Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet of Things(IOT), a variety of intelligent appliances such as audio-visual equipment, lighting systems, security systems, automatic doors and other automatic control systems, multi-function household robots, etc, are built for a comfortable home environment through home intelligent network platform. Smart Home has greatly improved the convenience of people’s lives.The voice control technology in smart home network platform can largely enhance the ability and convenience of interaction between human and intelligent appliances. With the improvement of speech recognition technology, more and more intelligent home appliances can gradually support voice control. The latest research is to embed dialog system into home networking platform, especially smart home devices can use voice of the same characteristics as family members to interact with users, i.e., personalized speech generation, for engaging and exciting with users. However, it is a big challenge to the speech research area due to the different voice characteristics of the family members.Speech synthesis plays an important role in the research of speech technologies. Personalized speech synthesis, in which the personalized voice characteristics can be synthesized with a small amount of target speakers’ recordings, is a new challenge to the state-of-art of speech synthesis systems and its research is of great value for applications. There are many deficiencies, e.g., mechanical and muffled, for the synthesized speech by the conventional technologies of personalized speech synthesis based on smart home due to the limited training samples from target speakers. The low quality in term of speaker similarity cannot meet the requirement of real applications. Furthermore, it is almost impossible to synthesize speech cross multiple languages or dialects.The predominant technologies of personalized speech synthesis, which are proposed by Tokuda and Huang, are mainly based on Hidden Markov Models(HMM). The inconsistency between training and synthesizing criteria can cause the inaccurately modeling in speaker adaptation or voice conversion, and result in the low quality of synthesized speech in term of speaker similarity and speech naturalness. The real applications, especially for smart home, are demanding the improvement of synthesized voice quality as well as human-machine interaction cross dialects to enhance users’ experiences on convenience and intimacy.Based on the above, this paper studies on the improvement of personalized speech generation related technologies, e.g., synthesis modeling, speaker adaptation and voice conversion. The main research contents are as follows,The source-filter mixed excitation model is employed to improve the naturalness and speaker similarity of synthesized speech in this paper. The methods of extracting periodicity ratios for mixed excitation model are revised, and the corresponding periodicity ratios are modeled by HMM in a slave manner, where the state boundaries are given by spectral and pitch models. The experimental results confirm the effectiveness of our methods, i.e., the voice quality of synthesized speech with mixed excitation model can be significantly improved.A frequency warping approach based on a time-varying bilinear function is proposed to reduce the weighted spectral distance between the source speaker and the target speaker for improving the accuracy of modeling in speaker adaptation. The experimental results show that our frequency warping approach can make the warped spectra of the source speaker closer to the target speaker, and the resultant adapted HMMs perform better than the HMMs trained by unwarped spectra in terms of synthesized speech naturalness and speaker similarity.To meet the requirement of cross-dialect voice conversion under the intelligent home environment, we propose to use neural network for cross-dialect voice conversion after investigating the classic methods. Pre-training and sequence training based on speech perception are also applied to neural network based voice conversion. The experiments results, which are carried on Mandarin and Shanghainess voice conversion, show that our approach is promising and worthwhile for further research work.The innovations and research results of this paper as follows,Firstly, two methods: comb filter and normalized correlation coefficient, of extracting periodicity ratios for mixed excitation model are systematically compared. The experimental results based on HMM based speech synthesis show that the voice quality of synthesized speech with mixed excitation model can be significantly improved and the method of Comb filter for extracting periodicity ratios slightly outperform normalized correlation coefficient.Secondly, the proposed method of speaker adaptation method based on frequency warping can significantly enhance the naturalness and similarity of generated personalized speech. Compared to the conventional methods, the proposed method has the following innovations.1) Employing a criterion of minimizing weighted log spectral distance between the source speaker and the target speaker, which is perceptually critical for voice characteristics, instead of using ML to transform the source speaker’s features. The resultant speech can improve the similarity to the target speaker perceptually.2) Performing a smooth transform over both frequency and time domains by a bilinear warping function with frame-dependent warping factors to well keep the time-variant nature of speech.3) Retraining the source speaker’s HMMs to get a better initialization for further adaptation.Thirdly, the method and training criterion of learning model for cross-dialect voice conversion is firstly proposed. It can realize voice conversion cross dialects, e.g., Mandarin and Shanghainese. Its main innovation lies in the following three aspects:1) Language-independent frequency warping method is applied in cross-dialect voice conversion, thereby greatly reducing the required amount of training data and computational complexity;2) Pre-training is employed to the training of neural network. It can drives weights to a better starting point than random initialization and speed up the convergence of neural network training algorithm.3) Sequence training, which mimics perception of speech and minimizes sequence-level errors and matches objectives, is proposed to training and converting, largely improves the performance of cross-dialect voice conversion.This paper has innovations and improvements on the technologies of personalized speech synthesis. It provides research ideas and references for the study of speech technology in smart home environment.
Keywords/Search Tags:Intelligent Home, Personalized Speech Generation, HMM, Cross-dialect Voice Conversion, Neural network
PDF Full Text Request
Related items