Font Size: a A A

Research And Implementation Of Mongolian Emotional Speech Synthesis System Based On Deep Learning

Posted on:2022-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:A H HuangFull Text:PDF
GTID:2518306788995029Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Speech synthesis technology is one of the important aspects of human-computer interaction,and is a technology that converts textual content into corresponding speech information.With the rapid development of deep learning technology,there are many methods that can synthesize high-quality neutral speech.In order to make synthetic speech more anthropomorphic,more and more researchers are investigating how to synthesize speech with emotional expressiveness.In recent years,with the rapid development of Mongolian language intelligent information processing technology,End-to-end Mongolian-based neutral speech synthesis technology has reached a practical application level.However,the research of Mongolian emotional speech synthesis is still in initial stage,and the research of Mongolian emotional speech synthesis technology is important to promote the development of Mongolian language and characters intelligence.The research content of this thesis is as follows:1.The corpus of Mongolian emotions was constructed.To address the problem of sparse Mongolian emotional speech corpus,this thesis constructs a Mongolian emotional corpus containing a total of 6.1 hours of female voices with eight emotions,including neutral,happy,angry,sad,surprised,fearful,disgusted,and sleepy;1.3hours of male voices with happy and angry emotions;and 2.27 hours of children's voices with both happy and sad emotions.2.A Mongolian emotional speech synthesis model based on reference speech is proposed.Based on the end-to-end Mongolian-based neutral speech synthesis model,the,the latent variable information in the reference speech is extracted by introducing a reference encoder and a variational auto-encoder,including emotion,speech rate,tone,etc.A Controlled Mongolian Emotional Speech Synthesis Model.In the model training phase,adopting transfer learning technology.First,the model is pre-trained with a large number of non-target speakers' neutral Mongolian speech,thereby obtaining a pre-trained model.Then,the model is fine-tuned using the emotional speech of the target speaker,and an emotionally controllable Mongolian emotional speech synthesis model is obtained.The experimental results showed that the MOS values of Mongolian emotional speech synthesized using female,male and child voices were 3.70,3.56 and 3.73,respectively.thus,it is clear that the method can synthesize speech with different emotions.3.Build a Mongolian emotional speech synthesis system.This thesis builds a Mongolian emotional speech synthesis system based on Client/Server(C/S)architecture,and the Mongolian emotional speech synthesis model based on reference speech is deployed in the system so that users can use the system to synthesize Mongolian speech with different emotions according to their needs.This thesis uses the Flask framework to build a Mongolian emotional speech synthesis service,and designs and implements a Mongolian emotional speech synthesis system based on Android system.
Keywords/Search Tags:Mongolian speech synthesis, end-to-end, reference voice, Mongolian sentiment speech synthesis, variational autoencoder
PDF Full Text Request
Related items