Font Size: a A A

Investigating The Key Problems In Deep Learning Based Acoustic Modeling For Speech Synthesis

Posted on:2021-09-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:S YangFull Text:PDF
GTID:1528307316995899Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the final chain of human-computer speech interaction,the performance of speech synthesis,which aims to produce natural speech response,directly affects the naturalness of interaction.As the key component of a speech synthesis system,the acoustic model aims at modeling the complicated relations between text and speech.With the rapid de-velopments of deep learning,deep neural network(DNN)based acoustic models have outperformed the hidden Markov model(HMM)based ones by a large margin.Based on the DNNs,this thesis mainly focuses several key aspects of acoustic modeling,includ-ing naturalness,speaker adaptation,pronunciation stability,and robustness.The main contributions and novelties of this thesis are summarized as follows:1)This thesis investigates the average acoustic model with continuous speaker rep-resentation named i-vector,and verifies the impacts of speaker distribution and speaker representation space on the performance of acoustic modeling.Multi-speaker acoustic model with discrete speaker labels requires a suitable set of data to fine-tune model parameters to obtain new speaker’s voice.With the help of contin-uous speaker representation,the unified average acoustic model can model the rela-tions among speaker identities to produce target speaker’s voice with only one sen-tence.Based on the i-vector based average model,this thesis investigates the impacts of speakers’ distribution and the size of continuous speaker space on acoustic mod-eling.Experimental results show that the use of speaker-dependent distribution can improve speech similarity during adaptation process,and a low-dimensional i-vector can represent speaker identity well for speech synthesis.2)This thesis proposes a multi-task framework with generative adversarial network(GAN)to improve the naturalness of synthesized speech.Conventional acoustic models often utilize mean square error(MSE)to estimate the parameters of networks,which only considers the numerical difference between the raw audio and the synthe-sized one.This thesis adopts GAN to model the speech distribution and combine it with MSE to stabilize the adversarial training process.Compared to the conventional MSE based acoustic model,the proposed method can improve the naturalness of syn-thesized speech.3)This thesis proposes a hybrid self-attention structure with relative-position-aware bias to improve the naturalness of speech synthesis.Since it is hard for recur-rent neural network(RNN)based sequence-to-sequence(seq2seq)acoustic model to model long-term contexts,this thesis introduces a modified self-attention with relative-position-aware method to model global contexts.To keep the sequential modeling ad-vantage of RNNs,this thesis further combines the RNNs with self-attention to result in a hybrid acoustic model.Experimental results show that the proposed model can significantly improve the naturalness of synthesized speech.4)This thesis proposes a novel self-attention with learnable Gaussian bias to model localness for speech synthesis,which improves the stability of seq2 seq based acous-tic models.Although self-attention based acoustic model can model global contexts,it ignores the contribution of local information to pronunciation,which may lead to pro-nunciation errors during synthesis.According to the monotonic nature of speech gen-eration,this thesis adopts a learnable Gaussian bias within the global context modeling to dynamically enhance local information at each time step,which greatly improves the stability of the acoustic model.5)This thesis proposes an adversarial feature learning and unsupervised clustering based acoustic model to model found data with acoustic and textual noise.Current seq2 seq model requires high-quality speech data to for acoustic modeling.This thesis investigates how to build robust acoustic model with noisy found data.For textual noise from error of speech recognition,we propose an unsupervised clustering method to learn phonetic-like information from speech to compensate the erroneous linguistic feature.Besides,we propose to use data augmentation and adversarial feature learning to deal with the acoustic noise.With noisy found data samples from the target speaker,the proposed model can synthesize clean and high-quality speech which is close to the system built on the clean counterpart.As for the speaker adaptation with a few noisy data,this thesis utilizes the adversarial feature learning approach to obtain robust multi-speaker model,and further introduces two adaptation frameworks with adaptive training and continuous speaker representation to produce clean target voice.
Keywords/Search Tags:Speech synthesis, acoustic model, deep learning, adversarial learning, end-to-end speech synthesis, sequence-to-sequence model
PDF Full Text Request
Related items