| With the development of computer and Internet technology the dialogue systems allowing human-machine interactions have taken place revolutionary changes. Correspondingly, speech synthesis technology converting the information produced by computer or inputted from outside such as words, signals, or numbers to continuous speech signals in terms of output This technology is also called Text To Speech (TTS). The early TTS system often employs synthesis methods based on parameters, such as formant synthesis or LPC synthesis. But the synthetic speech isn't natural enough. Concatenation synthesis method based on PSOLA algorithm has been employed in recent years. Compared with traditional synthesis technology, PSOLA algorithm can get speech marks of pronouncing signals through analyzing the text and then modify the pitch, time length and intense of the concatenated units. Without changing the details of voice quality of original speech phases, this method can output synthesized speech with high intelligibility and naturalness through changing prosodic characters. Based on the study of speech signal process and Chinese prosody the thesis deeply researches on speech synthesis and it's ways. The main researches follow: 1, Combining FD-PSOLA and TD-PSOLA advantages, a stepwise FD&TD PSOLA speech synthesis method is proposed. In this method, the pitch period and time length for the synthetic syllables are modified separately in frequency-domain and time-domain according to their target values produced from the Chinese prosody model. So that the prosodic parameters modification can be more effectively controlled than existing method, while without evident influence for the synthesized speech naturalness and intelligibility. The Chinese phrase synthesis experiments show that the method has good performance in speech synthesis.2, Aim for Chinese speech; a large number of typical utterance pitch contours are extracted and analyzed. Based on this, a data-driven prosody generation model for Chinese TTS is presented, which is mainly characterized on pitch parameter, combined with the duration and the gain. It incorporates hierarchical Chinese prosody messages, such as the sentence mood, phrase pace, tone and accent. The level controlling parameter can be trainable and attributed to the component A set of practical normalization multi-tone pattern functions and the sandhi rules are provided. Emulation tests show that the model and rules effectively represent therelationship between the prosodic features and the multi-layer linguistic information. The prosody process model is feasible and available. |