Font Size: a A A

Research On Indonesian Speech Synthesis Based On End-to-End Neural Network Models

Posted on:2023-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y LuFull Text:PDF
GTID:2545306617482844Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the rapid development of deep learning has opened up a new research direction in the field of speech synthesis.Scholars all over the world have begun to adopt end-to-end speech synthesis methods based on deep neural networks,and achieved amazing results.The end-to-end speech synthesis mode based on deep neural network shows absolutely better performance than the traditional statistical parameter synthesis methods.the end-to-end model can directly convert the input text sequence into the corresponding speech spectrum sequence through a series of neural networks.In some end-to-end speech synthesis studies that target English,Chinese and other lingua franca,the synthesized speech has been infinitely close to the natural speech emitted by human beings.However,Indonesian is a non-lingua franca,and the electronic corpus is very scarce,so the research on Indonesian speech synthesis is much slower than that in Chinese,English and other lingua franca.Starting from the text characteristics of Indonesian and taking Tacotron2 as the baseline model,this paper implements an Indonesian speech synthesis system based on end-to-end neural network model.On this basis,this paper focuses on the low-resource characteristics of Indonesian and the shortcomings of the Tacotron2 model to explore improvement methods.The main work of this paper is as follows:(1)An Indonesian speech database is constructed,and the conversion from word sequence to phoneme sequence is designed and realized according to the characteristics of Indonesian.The end-to-end speech synthesis baseline system of Indonesian based on Tacotron2 model is designed and implemented,and the shortcomings of the baseline model are analyzed.(2)On the basis of the baseline model,the hyperparameters of the model are optimized and modified,and then the attention constraint method is introduced to improve the use efficiency of the data during training,and a semi-supervised training scheme for low resource language is used.After freezing the encoder,the model decoder is pre-trained with Indonesian speech without text representation,so that it has a certain acoustic representation ability before formal training.In order to reduce the demand for corpus in model training.The MOS score of the optimized model reaches 3.84,which improves the intelligibility and naturalness of synthetic speech.(3)In view of the exposure bias of the Tacotron2 model decoder and the low resource characteristics of Indonesian,this paper uses the English corpus to pre-train the end-toend speech synthesis model,and then uses the Indonesian corpus to transfer the model parameters.On this basis,in order to overcome the influence of cumulative errors on synthetic long sentences,this paper proposes a "alternating training" method with adjustable probability,and then studies the optimal alternative training scheme.The MOS score of the end-to-end Indonesian speech synthesis model trained by this scheme reaches3.95.the undistorted super-long sentences can be synthesized completely,which effectively reduces the cumulative error caused by exposure deviation and achieves the desired results.In view of the shortcomings of the baseline model of Indonesian speech synthesis,the above two optimization and improvement schemes proposed and implemented in this paper accelerate the convergence speed of the model training,reduce the amount of speech data required for the training model,and improve the quality of synthetic speech.High-quality Indonesian speech can be synthesized under the premise of low resources.
Keywords/Search Tags:Indonesian, End-to-End speech synthesis, Semi-supervised training, Alternate training
PDF Full Text Request
Related items