| The goal of speech synthesis is to generate artificial speech that is natural,fluent,and accurate,in order to meet the needs of speech interaction technology in various fields and scenarios.As a minority language,Tibetan also needs to be fully developed and applied in the information age.Developing a high-quality Tibetan speech synthesis system can improve the popularity and influence of Tibetan in education,cultural heritage,and other areas,promote multicultural exchanges and understanding,and promote mutual understanding and integration among ethnic groups.Among the three major dialects of Tibetan,Amdo Tibetan is one of the more widely used dialects.Therefore,this paper studies Amdo Tibetan speech synthesis technology.The specific work is as follows:(1)Construction and preprocessing of Tibetan corpus: This paper uses web crawlers to collect and organize a 4.8MB Tibetan text corpus and preprocesses the collected and organized text corpus by standardizing and annotating with international phonetic symbols;a 4.57 G Tibetan speech corpus was established by recording,and the established Tibetan speech corpus was preprocessed with high-frequency pre-emphasis,frame segmentation,windowing,and feature extraction of spectrograms,Mel spectra,and Mel cepstral coefficients.(2)Text-to-speech alignment of Tibetan text and speech: This paper studies the HMM-GMM-based and CNN-CTC-based text-to-speech alignment methods for Tibetan.In the HMM-GMM-based text-to-speech alignment model,the speech features are normalized by mean-variance normalization of cepstral features and feature-space maximum-likelihood linear regression,and single phonemes and triphones are used as inputs for modeling to obtain more accurate alignment results by considering the context on both sides of the phonemes.In the CNN-CTC-based text-to-speech alignment model,convolutional neural networks are used to transform the features through multiple layers to achieve translation invariance.The speech and text are aligned using fully connected networks and connectionist temporal classification.Experiments showed that the average absolute differences in boundary positions for characters,syllables,and phonemes were 12.1,13.5,and 15.4 milliseconds,respectively,for the CNN-CTC-based model,so duration-continuation features were extracted using this model.(3)Tibetan speech synthesis: This paper studies a non-autoregressive network structure for the synthesis model,which includes a Tibetan speech synthesis acoustic model based on Fast Speech2 and a Tibetan speech synthesis vocoder based on Hi Fi GAN.The acoustic model uses duration features for text-to-speech alignment and incorporates speech properties such as fundamental frequency and energy,using variable information adapters to adjust speech speed and intonation for greater control.This model solves the one-to-many mapping problem and reduces cases of word skipping and repetition.The vocoder uses a generator in a generative adversarial network to convert Mel spectrograms into high-quality waveform speech.The highest subjective evaluation value was 4.32,and the real-time rate of objective evaluation reached 4.85×10-3.The training speed is nearly three times faster than the baseline model. |