Font Size: a A A

TTS Realization Of Mel Spectrogram Prediction Method Based On Deep Learning

Posted on:2021-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y N LiuFull Text:PDF
GTID:2438330602497668Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Speech synthesis is a technology that produces artificial speech through mechanical,electronic,etc.methods,and has shown a new height in the wave of artificial intelligence development today.Among the three levels of speech synthesis,Text-To-Speech(TTS)is the main direction of modern development research.It mainly uses the expression of text and narrative to produce human voices through a series of methods.The waveform stitching method and the statistical parameter method are two popular synthesis methods in this field,and have a high technical level before the development of deep learning algorithms.The waveform stitching method needs to analyze and extract the prosodic features of the text,select the appropriate speech fragment units in the corpus according to the needs of synthesis,and stitch these selected unit waveforms in the time domain to synthesize speech.The speech generated by this method has basically met the demand for synthesized speech at that time.The disadvantage is that a larger speech library needs to be established,the synthesis time is longer and there is a clear gap between the sounds;The matching degree is modeled on the mathematical relationship,the acoustic features required to generate speech are predicted by certain techniques,and finally the acoustic features are converted into time-domain speech waveforms by means of vocoders.This method has very high requirements on the quality of the vocoder,coupled with the deviation of the prediction accuracy of the acoustic features,the naturalness of the synthesized speech is not greatly improved,and the machine sound is obvious.In recent years,the development of deep learning has also brought new directions for the method of speech synthesis.Its unique network model algorithm has also demonstrated superior performance in many fields.Based on the above research background,this paper focuses on the prediction of intermediate acoustic characteristics required for speech synthesis based on deep learning methods.The TTS system contains two main modules: a prediction model block of text to acoustic features and a vocoder module that converts acoustic features into speech.In the study of the TTS front-end to predict its corresponding acoustic characteristics based on text and text,this paper uses the sequence to sequence(Seq2Seq)deep learning network model as the basis and uses the lower-level Mel spectrogram as the acoustic feature.In terms of expression,compared with the extraction of complex phonetic features,Mel spectrograms are easier to obtain and more portable,that is,text-independent features,reducing costs.At the same time,it simplifies Tacotron's network model that uses end-to-end model prediction to generate Mel spectrograms,removes the complex network structure in the middle,and uses Convolutional Neural Network(CNN),attention mechanism and Recurrent Neural Network(RNN)The network stack form,while simplifying the model structure,also merges the acoustic feature information such as text,words,prosody,etc.,which enriches the details of synthesized speech.In the research of restoring the Mel spectrogram generated by the front-end prediction into the speech time-domain waveform,this paper uses the WaveNet model as the back-end vocoder needed to restore the speech.Because of its autoregressive deep network generation characteristics,the prediction speed is slow and it cannot be a real-time speech synthesis tool.It is widely criticized.Later,it uses parallel WaveNet technology based on inverse autoregressive flow.The unique feature of its improvement is that it can quickly convert acoustic features into corresponding speech time-domain waveforms,reaching real-time standards,improving the efficiency of model training and loading,and comparing the Griffin-Lim algorithm with obvious traces of artificial synthesis,using WaveNet as a vocoder The naturalness of the output speech is higher.
Keywords/Search Tags:TTS, Deep Learning, Mel Spectrogram, Seq2Seq, WaveNet Vocoder
PDF Full Text Request
Related items