Font Size: a A A

Research And Implementation Of End-to-End Prosodic Speech Synthesis System

Posted on:2022-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:W H NiuFull Text:PDF
GTID:2518306341450674Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the mobile Internet and the continuous improvement of computer computing capabilities,a series of breakthroughs have been made in voice technology.The application scenarios of speech synthesis are becoming more and more extensive,such as Siri,Xiao Ai and other mobile phone voice assistants,smart speakers and so on.In order to meet the diverse needs of people and a better user experience,voice technology with rich timbre,close to real people,and full of emotion has become a market development need.Current speech synthesis is mostly prosodic monotonous,and it is difficult to synthesize different intonations from different input texts.In the multi-speaker speech synthesis with zero learning,there are also problems such as inaccurate coding identities of unseen speakers.This article focuses on the following two aspects:First,the MAI(Multiple Acoustic Information)module and the SA-M(Multiple Self-Attention)module are proposed.The former can obtain pitch and loudness information from the reference audio when there is a reference audio,and the latter uses a multi-layer self-attention module to mine the potential syntactic and semantic information in the input text,and uses iterative aggregation to aggregate information between different levels.Furthermore,this paper proposes an expressive prosody speech synthesis model MAI-SA(Multiple Acoustic Information and Self-Attention)based on Tacotron2.Finally,this paper conducts a comparative experiment on each module through subjective and objective evaluation,and has achieved better results than the widely used Tacotron2 and Tacotron2-GST models.Secondly,the MSA-LDE(Muti-Scale Aggregation-LDE)speaker coding module is proposed.This module is composed of Resnet34 and LDE modules.This paper draws on the idea of pyramid network in the image field,extracts information at different stages of Resnet34,and then uses LDE to capture speaker information at different scales.In this paper,this module is applied to the MAI-SA model.In the speech synthesis of unseen speaker identity features,it is better than the commonly used speech synthesis scheme of the speaker encoder trained by using GE2E as a loss.Effect.Finally,this paper designs a simple and easy-to-use speech synthesis platform,which uses the expressive speech synthesis model MAI-SA and the speaker speech synthesis model MSA-LDE model to verify the usability of the platform through experiments.
Keywords/Search Tags:expressive speech synthesis, acoustic information, self-attention mechanism, zero-shot
PDF Full Text Request
Related items