Research And Implementation Of End-to-End Prosodic Speech Synthesis System

Posted on:2022-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:W H Niu

Full Text:PDF

GTID:2518306341450674

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the mobile Internet and the continuous improvement of computer computing capabilities,a series of breakthroughs have been made in voice technology.The application scenarios of speech synthesis are becoming more and more extensive,such as Siri,Xiao Ai and other mobile phone voice assistants,smart speakers and so on.In order to meet the diverse needs of people and a better user experience,voice technology with rich timbre,close to real people,and full of emotion has become a market development need.Current speech synthesis is mostly prosodic monotonous,and it is difficult to synthesize different intonations from different input texts.In the multi-speaker speech synthesis with zero learning,there are also problems such as inaccurate coding identities of unseen speakers.This article focuses on the following two aspects:First,the MAI(Multiple Acoustic Information)module and the SA-M(Multiple Self-Attention)module are proposed.The former can obtain pitch and loudness information from the reference audio when there is a reference audio,and the latter uses a multi-layer self-attention module to mine the potential syntactic and semantic information in the input text,and uses iterative aggregation to aggregate information between different levels.Furthermore,this paper proposes an expressive prosody speech synthesis model MAI-SA(Multiple Acoustic Information and Self-Attention)based on Tacotron2.Finally,this paper conducts a comparative experiment on each module through subjective and objective evaluation,and has achieved better results than the widely used Tacotron2 and Tacotron2-GST models.Secondly,the MSA-LDE(Muti-Scale Aggregation-LDE)speaker coding module is proposed.This module is composed of Resnet34 and LDE modules.This paper draws on the idea of pyramid network in the image field,extracts information at different stages of Resnet34,and then uses LDE to capture speaker information at different scales.In this paper,this module is applied to the MAI-SA model.In the speech synthesis of unseen speaker identity features,it is better than the commonly used speech synthesis scheme of the speaker encoder trained by using GE2E as a loss.Effect.Finally,this paper designs a simple and easy-to-use speech synthesis platform,which uses the expressive speech synthesis model MAI-SA and the speaker speech synthesis model MSA-LDE model to verify the usability of the platform through experiments.

Keywords/Search Tags:

expressive speech synthesis, acoustic information, self-attention mechanism, zero-shot

PDF Full Text Request

Related items

1	The Modeling Research For Speech Emotion Towards Expressive Speech Synthesis
2	Improved Tacotron2 Speech Synthesis Method Based On Forced Monotonic Attention Mechanism
3	Research On Deep Learning Based End-to-End Chinese Speech Synthesis
4	Research On Algorithms Of Speech Synthesis Based On Deep Neural Network
5	Research On Neural Network-based Acoustic Modeling For Speech Synthesis
6	Based On End-to-end Chinese Speech Synthesis Research
7	Local Self-Attention CTC-Based Speech Recognition
8	The Research Of Personalized Speech Synthesis Based On Generative Adversarial Network
9	Research And Application Of Speech Synthesis Technology Based On Deep Learning
10	Multimodal analysis of expressive human communication: Speech and gesture interplay