Speech Emotion Recognition(SER)is an effective method in bridging the communication divide between humans and computers which plays a critical role in Human-Computer Interaction(HCI).SER aims to identify various types of human emotions from a given speech signal sample.Traditional SER training and testing are performed on a single corpus.However,due to differences in recording equipment quality,spoken language,environmental noise,and subject statistics,the feature distribution between various corpora is different.When training(source domain)and testing(target domain)are performed on different corpora,the performance of traditional SER methods will decrease significantly.To overcome the above problems,this thesis aims to develop a study on cross-corpus SER based on deep learning.A summary of the principal contributions of this thesis is as follows:(1)We suggest an unsupervised domain adaptation method-based Transformers and domain adversarial neural network.This method first extracts the IS09 and IS10 feature sets in the INTERSPEECH emotional challenge,and then uses the encoder of Transformers to learn the context information from the extracted manual features to obtain the time series features of each utterance.In order to obtain domain-invariant features suitable for cross-corpus SER,a domain discriminator is designed to classify speech samples from the source domain or target domain through training.It encourages domain confusion by adversarial goals to learn shared feature representations between domains.(2)We suggest an unsupervised feature decomposition domain adaptation methodbased Transformers and mutual information.This method uses a pre-trained deep audio model for audio feature extraction,and then uses the encoder layer of Transformers to construct a domain-invariant feature extractor.A Max-Min Mutual Information strategy is designed to learn domain-invariant features from input deep features and perform final emotion classification.Finally,to minimize the impact of speaker deviations on the model performance,we design a speaker discriminator that forces the domaininvariant feature extractor to discard speaker information.(3)The cross-corpus experiments on three public speech emotion databases(IEMOCAP,MSP-Imporv and CASIA)indicate that the presented method can effectively improve the performance of cross-corpus SER and improve the classification accuracy.In addition,the proposed method is compared with the baseline model and other state-of-the-art domain adaptation methods.The results reveal that the suggested method attains the highest experimental results,which verifies the validity of the suggested method in cross-corpus SER research.This thesis focuses on modeling optimization of cross-corpus speech samples based on deep learning,which is applied to improve the recognition performance of cross-corpus speech emotions.By comparing with the baseline model and the state-ofthe-art domain adaptation,the model presented in this thesis achieves the optimum results and proves the validity of the presented model.In future research,constructing models with more generalization capability and fusing other modalities for cross-corpus SER are important research directions. |