Font Size: a A A

Research On Speech Synthesis Method With Emotion Embedding

Posted on:2024-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:J T PengFull Text:PDF
GTID:2568307130453114Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Speech synthesis is a technique for converting text to speech,which involves techniques from various disciplines such as acoustics,linguistics,digital signal processing,computer science,natural language processing and statistical learning,and plays an important role in the field of human-computer interaction.With the rapid development of deep learning,speech synthesis technology has received more and more attention.While most of the work is still focused on the naturalness and expressiveness of synthesized speech,emotion is also an indispensable part of daily verbal communication,so emotional speech synthesis is becoming a new research hotspot and more and more researchers are investigating how to synthesize high quality speech with emotional expressiveness.In the field of emotional speech synthesis,the dominant approach is to obtain stylistic embeddings from reference speech,but this approach tends to learn an average stylistic representation and fails to synthesize significantly emotional speech.In addition,there are few open source datasets suitable for emotional speech synthesis tasks,which often lead to overfitting problems due to insufficient training data and limit the development of emotional speech synthesis.To address the above problems and challenges,this thesis investigates speech synthesis methods with emotion embedding,the main work is as follows.(1)Proposed emotional speech synthesis method based on conditional variational autoencoder.The method uses conditional variational autoencoder to decouple the emotional representation from the reference audio and separate it from the latent space as a conditional factor,while the latent space also encodes other stylistic information(e.g.tone,speed of speech,intonation,etc.)to obey a standard normal distribution,finally embeds it into the speech synthesis model.The method solves the problem that the emotion information in the reference speech is entangled with other paralinguistic information,making it difficult to extract significant emotional features.A duration prediction module is also designed to assist the alignment of sequence lengths between phonemes and spectrograms.The experimental results on the publicly available dataset ESD show that the proposed method outperforms the current mainstream methods VAE-Tacotron and GST-Tacotron.The proposed method has the lowest MCD(Mel Ceptrum Distortion)values on all emotions,the MSD(Mel spectral distortion)and FFE(F0 Frame Error)are the lowest on average.In subjective evaluation experiments,MOS(Mean Opinion Score)scored highest,and the A/B preference experiment received the most choices.(2)Proposed speech synthesis method based on emotion transfer.The method introduces the idea of transfer learning,extracts the speech representation from the reference speech by fine-tuning the pre-trained speech model,and combines it with an emotion classifier to map it into the emotion space.Finally,it is embedded in the speech synthesis model to generate emotion speech,solving the problem of model performance degradation due to insufficient training data.The mutual information neural estimation method is also used to remove the linguistic information from the emotional embedding to solve the problem of content leakage.Experimental results on the open source low-resource emotion dataset Emo V-DB show that the proposed method outperforms other methods using transfer learning in both subjective and objective experiments,with the proposed method having the lowest MCD and MSD values on all emotions,the lowest average FFE,the highest MOS score,and the most choices obtained in the A/B preference test experiment.(3)Design and implement a speech synthesis prototype system with emotion embedding.Matlab software was used to design the interface of the prototype system,Pytorch deep learning framework and Python language were used to design the relevant algorithms.The prototype speech synthesis system with emotion embedding consists of four main modules:the emotional speech dataset upload module,the model training module,the speech synthesis model and the speech playback and visualization module.The proposed emotional speech synthesis method based on conditional variational autoencoder and the emotional speech synthesis method based on transfer learning are both implemented in this system.The prototype system demonstrates and validates the rationality and effectiveness of the proposed method.
Keywords/Search Tags:emotional speech synthesis, deep learning, conditional variational autoencoder, transfer learning
PDF Full Text Request
Related items