| Voice conversion is a kind of intelligent speech technology for converting the personality characteristics of a source speaker into those of a target speaker,while keeping the linguistic information unchanged.Thanks to the rapid development of artificial intelligence technology,voice conversion technology continues to break through bottlenecks and gradually completes the implementation of commercial scenes,which brings many conveniences to people’s life,but there are still some problems.Firstly,traditional voice conversion methods focus on the personality characteristics of speakers,and pay less attention to emotion conversion.Secondly,in order to achieve better performance,the model needs a lot of data for training.Finally,most emotional voice conversion methods focus on the conversion of spectral features,and only use a log-Gaussian normalization function to convert fundamental frequency features.In this paper,we conduct a series of researches and improvements on emotional voice conversion for the above problems based on the Star GAN model.First,this paper proposes an emotional voice conversion method based on the StyleGAN-EVC model.The emotion style encoder is used to extract the emotion style features of speech.Compared with the one-hot vector used in Star GAN model,emotion style features express more emotional information.At the same time,through adaptive instance normalization,the extracted emotion style features are fully integrated with the semantic features extracted by the encoder network of generator,so as to realize emotion conversion.In addition,in the process of joint optimization,the emotion style encoder is constrained by the circular consistency loss and the emotion style reconstruction loss,so that it can effectively extract emotion style features,and the semantic features can adaptively match the emotion style features through the adaptive instance normalization.It is a key step towards practical application to extend the emotional voice conversion from closed set to open set with no label requirement for training data.Sufficient objective and subjective experiments show that,in the case of closed set,compared with the baseline method Star GAN,the converted speech of the StyleGAN-EVC model proposed in this paper has an average decrease of 15.23% in MCD value,8.68% decrease in RMSE value,36.76% increase in MOS,and 12.50% increase in sentiment classification rate,which verifies that the StyleGAN-EVC model proposed in this paper can not only improve the quality of the converted speech,but also improve the emotional saturation.Compared with the closed set case,the performance of the StyleGAN-EVC model in open set case slightly decreases,and the MCD value of the converted speech increases by 0.95% on average,the RMSE value increases by 0.35%,the MOS decreases by 0.81%,and the sentiment classification rate decreases by 1.88%,which verifies the StyleGAN-EVC proposed in this paper can realize the emotional voice conversion in open set without compromising the quality and emotional saturation of the converted speech.Further,in order to enhance the emotional saturation of the converted speech,this paper proposes an emotional voice conversion method based on StyleGAN-EVC with fundamental frequency difference compensation.At present,in most emotional voice conversion methods,the fundamental frequency features are only converted by a log-Gaussian normalization function,however,they show an overall upward trend after converted,meanwhile,mean and standard deviation cannot accurately describe the the amplitude difference between two emotions.In view of this,we propose the fundamental frequency difference compensation vector in this paper,and by adjusting the fundamental frequency features converted by the log-Gaussian normalization function,it can compensate and expand the amplitude difference between two emotions,thereby improving the emotional saturation of the converted speech.Sufficient objective and subjective experiments show that compared with the baseline method StyleGAN-EVC,the StyleGAN-EVC model with fundamental frequency difference compensation proposed in this paper has no change in MCD value of the converted speech,a 9.21% decrease in RMSE value,a 2.44% increase in MOS,and a 5.00% increase in sentiment classification rate,which verifies the effectiveness of the fundamental frequency difference compensation vector proposed in this paper to improve the emotional saturation of the converted speech.To sum up,by using the emotion style encoder and the fundamental frequency difference compensation vector,this paper significantly improves the sound quality and emotional saturation of the converted speech,and can realize high quality emotion voice conversion in open set without compromising the quality and emotional saturation of the converted speech. |