Font Size: a A A

Design And Implementation Of End-to-End Indonesian Speech Synthesis System

Posted on:2022-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:L X ZhaoFull Text:PDF
GTID:2545307037985659Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Deep learning advanced technologies emerge one after another to drive the improvement of speech synthesis technology for the past years.Speech synthesis systems produce sounds that increasingly resemble those of real people.Indonesian is a noncommon language.Compared with common language such as Chinese and English,Indonesian speech synthesis technology is underdeveloped and not advanced enough.Improving the performance of the Indonesian speech synthesis system with more advanced technologies remains a research focus.The paper devotes to developing an endto-end Indonesian speech synthesis system,and the following research is carried out.Firstly,an end-to-end Indonesian baseline speech synthesis system is designed and implemented.Simultaneously,we have completed the proofreading of the existing Indonesian corpus to meet the input requirements of the end-to-end system.And we add constraint term into the attention mechanism of the Indonesian speech synthesis system to speed up the convergence of system and improve the stability of the system.The performance of the end-to-end Indonesian speech synthesis system is evaluated by Mel Cepstrum Distortion,Gross Pitch Error and F0 Frame Error.Then,the thesis implements an Indonesian speech synthesis system based on the BERT pre-trained language model against the problem of lack of training data for lowresource languages.We also explore the improvement method of end-to-end speech synthesis system for other low-resource languages.In this paper,contextual information concatenation method and word vector concatenation method are used to embed BERT pre-trained word vector information into speech synthesis system.And we compare the effect of different encoders on the performance of the speech synthesis system.In the next place,the global style token(GST)is added into the synthesis system for the sake of enhancing the naturalness of produced Indonesian speech.The ability of GST model to learn the prosodic features of input audio can help us extract the audio prosodic features.It can be used as the additional input of the system to improve the quality of synthesized speech.Meanwhile,we propose two methods for predicting prosodic features using the input text information,relying on text to improve the naturalness of synthesized speech in the practical process.Finally,three indicators of Mel Cepstrum Distortion,Gross Pitch Error and F0 Frame Error are selected to evaluate the speech synthesized of the system in this paper.Mean Opinion Score,Attention alignment diagram and Mel spectrum diagram of synthesized speech are used to evaluate the performance of the system comprehensively in the meantime.The experiment results show that Indonesian speech synthesis system based on the BERT pre-training language model and the GST-based Indonesian speech synthesis system proposed in the article are better than the end-to-end Indonesian baseline speech synthesis system in all aspects.
Keywords/Search Tags:Indonesian, Speech Synthesis, End-to-End, Pre-trained Language Model, Global Style Token
PDF Full Text Request
Related items