Research And Design Of Speech And Text Fusion Multimodal Emotion Recognition Scheme Based On Deep Learning

Posted on:2023-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:B J Li

Full Text:PDF

GTID:2568306836471424

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Verbal information includes not only semantic information,but also emotional information.Emotion analysis can help human-computer interaction system capture the real purpose and potential intention of the speaker,so as to give a positive response.Therefore,Speech Emotion Recognition(SER)has been widely studied.At present,multimodality has become a research hotspot in the field of SER.The correlation between multiple modal information is used to improve the performance of the system.This paper focuses on the multimodal emotion recognition method of speech and text fusion to improve the accuracy of SER task.Aiming at the shortcomings of linear fusion and unable to capture the interaction between modes in the current multimodal emotion recognition,two multimodal emotion feature fusion schemes are proposed.The main research contents of this paper are as follows:Aiming at the problem of insufficient effective interactive fusion between speech and text modes,a multimodal emotion recognition scheme based on Double Fusion Network(DFN)is proposed.Firstly,the preprocessed speech and text feature vectors are multiplied and fused by Factorized Bilinear Pooling(FBP)fusion module,The fused feature vectors are learned through the coding network composed of three sub models: Long Short Term Memory(LSTM),Gated Recurrent Unit(GRU)and Deep Neural Network(DNN).The outputs of the three coding networks are fused twice by Hadamard dot product,Then the fused features are input into the Bidirectional Long Short Term Memory(Bi LSTM)network to learn the context dependent emotional feature information.Finally,the extracted speech and text cross fusion feature vector is connected to the classification output layer for emotional discrimination.The proposed DFN model is evaluated on the public emotion IEMOCAP dataset,reaching 80.38% WA and 78.62% UA,which verifies the effectiveness of our proposed DFN model.The research is carried out from two aspects: obtaining the interactive information between and within speech and text modes in an all-round way.Based on the DFN model,a multimodal emotion recognition scheme of Multi-channel Parallel Fusion Network(MPFN)based on speech,text and their cross features is proposed.The accuracy of SER task is improved by fusing the coding features of three different channels,so as to obtain better and more accurate emotion prediction.The core of MPFN is to use parallel cross fusion channel,speech feature coding channel and text feature coding channel to obtain the interactive information between and within the modes of speech and text in an all-round way.The network framework composed of Convolutional Neural Network(CNN),Bi LSTM and Self Attention(SA)is used to extract speech emotion features with high contribution to Melspectrum;The network framework composed of Bi LSTM and SA is used to extract the text emotional features with high contribution to the text vector output by glove model,and the DFN model is used to extract the cross fusion features of speech and text signals.Finally,the speech features,text features and cross fusion features of speech and text obtained by fusion learning are used for emotion discrimination.The proposed MPFN model is evaluated on the public emotion IEMOCAP dataset,reaching 81.53% WA and 81.22% UA,which verifies the superiority of our proposed MPFN model.

Keywords/Search Tags:

Multi-modality Emotion Recognition, Double Fusion, Self-attention Mechanism, Multi-channel Parallel Fusion

PDF Full Text Request

Related items

1	Human Action Recognition Based On Attention Mechanism And Multi-Modality Feature Fusion
2	Research Of Emotion Recognition Based On Multi-modal Fusion
3	Research On Feature Fusion Method Of Speech Emotion Recognition Based On Deep Learning
4	Multi-level Modality Representation Fusion For Emotion Analysis
5	Research On Speech Separation Based On Multi-Modality Fusion
6	Research On Speech Emotion Recognition Method Based On Multi-feature And Multi-modal Fusion
7	Research On Multimodal Emotion Recognition In Multiplayer Dialogue Scene
8	Speech Emotion Recognition Based On Deep Learning And Multi-Feature Fusion
9	Research On Speech Emotion Recognition Based On Multi Features Fusion
10	Research On Multimodal Emotion Recognition Based On The Fusion Of Temporal And Spatial Features