| Speech conversion can enrich speech style and is widely used in video game dubbing,webcast and other field.It is a hot spot in speech research today.Speech style includes speaker style,emotional style and so on.Speech conversion refers to converting the source style of speech into target style.In recent years,speech conversion has made some progress,but there are still some unsolved problems.For example,in terms of voice conversion,most voice conversion methods can not separate source speaker information from content information completely and entanglement of content information and target speaker information is unsatisfying,so that those methods have poor speaker similarity and can not perform zero-shot conversion,which means that they can not convert voice into target speaker,who is out of training set.In terms of emotion conversion,it is also difficult to separate the source emotion from the content information.For the above difficulties,this paper studies voice conversion and emotion conversion based on deep learning.Main work is shown below.Firstly,this paper proposes a zero-shot voice conversion method based on channel attention.On the one hand,channel attention with channel width constraint forms a learnable bottleneck in the content encoder to reduce the redundancy in coding and disentangle content information and source speaker information in speech.On the other hand,channel attention mechanism is used to map the content features to target domain in the decoder for conversion,couple the content information with the target speaker information.Meanwhile,to improves the speaker embedding effect,an auxiliary speaker classifier is set behind speaker encoder during training.Experiments show that the speech generated by the proposed method is better than that of baseline models in speaker similarity and speech naturalness.Secondly,this paper also proposes an emotion conversion method based on vector quantization.For disentangling content information and emotional information,the method removes redundant emotional information and achieves better decoupling effect by using vector quantization method to discretize the content information through limited symbols.The vector quantization methods mentioned above may lead to obvious content loss,thus the proposed method designs a time-frequency domain random resampling module to preprocess the data,so as to generate parallel emotional corpus for assisting supervised training.Experiments show that the proposed method has better emotion conversion effect and can generate high-quality converted speech.To sum up,this paper proposes new methods for feature disentanglement and entanglement in speech conversion task.Proposed methods enrich speech style,improve the naturalness of speech interaction,support the construction of large speech datasets and provide a reference for the further development of speech research. |