| Emotional voice conversion is a special kind of voice conversion,which is a technique to convert the emotional state in the source speaker’s speech,to another emotional state,while maintaining the semantic content and the speaker’s identity.The speech data used in emotional voice conversion is divided into non-parallel and parallel speech data,with the main difference being whether pairs of speech data with the same semantic content are used.The use of parallel speech data has the advantages of not requiring a large amount of training data and faster voice conversion speed than the use of non-parallel speech data,which are suitable for practical applications.However,before emotional voice conversion,parallel speech data need to be aligned using Dynamic Time Warping(DTW)algorithm,which has the problem of under-matching or over-matching in emotional speech data due to the fluctuating and irregular changes of emotional speech.In addition,the main focus in emotional voice conversion is the conversion of spectral features,but the fundamental frequency feature is also one of the important features in speech emotion expression,which reflects the timbre characteristics of the speaker,but is not paid attention to.In this paper,to address the above problems and consider the application of emotional voice conversion in practice,we analyze and study the features and models of emotional voice conversion,as well as the improvement of DTW algorithm for emotional voice conversion,and the specific research work includes the following:1.To address the problem that traditional emotional voice conversion does not pay attention to the fundamental frequency conversion,this paper proposes a mixed-mode emotional voice conversion method based on fundamental frequency feature segments.The method takes the logarithm of the fundamental frequency features,constructs multidimensional logarithmic fundamental frequency feature segments,and uses an artificial neural network(ANN)for training to convert the fundamental frequency features of emotional speech and improve the robustness of emotional voice conversion.A deep bidirectional long and short-term memory network(DBi LSTM)is also used to train Mel-cepstral coefficient(MFCC)features transformed by spectral features,from which the spectral mapping relationship between source and target speech is constructed,as well as to learn the relationship between spectral contexts.The experimental results show that the method is able to improve the converted emotional speech by 8% in subjective evaluation MOS by using a hybrid model to convert the spectral features and the fundamental frequency features separately.For the spectral feature conversion,it decreases 9% on the objective evaluation MCD compared to other models.For the base-frequency feature conversion of the multidimensional logarithmic base-frequency feature segment used in this paper,the average decrease in the objective evaluation F0-RMSE is 18%compared to the base-frequency conversion using the traditional Gaussian normalization function.2.In parallel emotional speech data,an emotional speech alignment algorithm based on Shape DTW++ is proposed in this paper to address the problems of over-alignment and lack of local matching in the traditional DTW algorithm for emotional speech data alignment.The new algorithm improves on the DTW algorithm by first converting the input sequence of the DTW algorithm into a descriptor sequence using the shape descriptors in the Shape DTW algorithm to increase the local matching.Then the descriptor sequences are aligned by adding adaptive cumulative distance loss weights and relaxation endpoints to the DTW algorithm,thus reducing the over-alignment phenomenon of emotional speech data.Through experiments,it is shown that the Shape DTW++ algorithm can obtain better emotional speech alignment data than the DTW algorithm,reduce the loss brought by the input data to the model training,and thus improve the final emotional voice conversion effect.3.Through the above research content,this paper designs and implements the visualization application system of the mixed model emotional voice conversion method based on the fundamental frequency feature segment for emotional voice conversion,which includes the display of the emotional speech data and features,the display of the results of the parallel emotional speech data alignment and the display of the emotional voice conversion results,visualizing the application of the emotional voice conversion method in this paper for emotional voice conversion. |