| Verbal information includes not only semantic information,but also emotional information.Emotion analysis can help human-computer interaction system capture the real purpose and potential intention of the speaker,so as to give a positive response.Therefore,Speech Emotion Recognition(SER)has been widely studied.At present,multimodality has become a research hotspot in the field of SER.The correlation between multiple modal information is used to improve the performance of the system.This paper focuses on the multimodal emotion recognition method of speech and text fusion to improve the accuracy of SER task.Aiming at the shortcomings of linear fusion and unable to capture the interaction between modes in the current multimodal emotion recognition,two multimodal emotion feature fusion schemes are proposed.The main research contents of this paper are as follows:Aiming at the problem of insufficient effective interactive fusion between speech and text modes,a multimodal emotion recognition scheme based on Double Fusion Network(DFN)is proposed.Firstly,the preprocessed speech and text feature vectors are multiplied and fused by Factorized Bilinear Pooling(FBP)fusion module,The fused feature vectors are learned through the coding network composed of three sub models: Long Short Term Memory(LSTM),Gated Recurrent Unit(GRU)and Deep Neural Network(DNN).The outputs of the three coding networks are fused twice by Hadamard dot product,Then the fused features are input into the Bidirectional Long Short Term Memory(Bi LSTM)network to learn the context dependent emotional feature information.Finally,the extracted speech and text cross fusion feature vector is connected to the classification output layer for emotional discrimination.The proposed DFN model is evaluated on the public emotion IEMOCAP dataset,reaching 80.38% WA and 78.62% UA,which verifies the effectiveness of our proposed DFN model.The research is carried out from two aspects: obtaining the interactive information between and within speech and text modes in an all-round way.Based on the DFN model,a multimodal emotion recognition scheme of Multi-channel Parallel Fusion Network(MPFN)based on speech,text and their cross features is proposed.The accuracy of SER task is improved by fusing the coding features of three different channels,so as to obtain better and more accurate emotion prediction.The core of MPFN is to use parallel cross fusion channel,speech feature coding channel and text feature coding channel to obtain the interactive information between and within the modes of speech and text in an all-round way.The network framework composed of Convolutional Neural Network(CNN),Bi LSTM and Self Attention(SA)is used to extract speech emotion features with high contribution to Melspectrum;The network framework composed of Bi LSTM and SA is used to extract the text emotional features with high contribution to the text vector output by glove model,and the DFN model is used to extract the cross fusion features of speech and text signals.Finally,the speech features,text features and cross fusion features of speech and text obtained by fusion learning are used for emotion discrimination.The proposed MPFN model is evaluated on the public emotion IEMOCAP dataset,reaching 81.53% WA and 81.22% UA,which verifies the superiority of our proposed MPFN model. |