| Multimodal Emotion Recognition(MER)has become one of the challenging and hot research topics in the field of affective computing.MER aims to recognize human emotional states through multiple modal information related to human emotion expressions such as audio and text.In recent years,with the rapid development of machine learning techniques,newly-developed deep learning techniques have also been applied to emotion recognition tasks,and achieved great progress.To this end,this paper adopts deep learning techniques and presents an in-depth study of MER combining audio and text modalities.A fundamental challenge of MER is how to effectively fuse multimodal information.Many existing works have often used traditional methods such as featurelevel fusion and decision-level fusion for multimodal information fusion.However,none of these traditional multimodal information fusion methods takes into account the impact of cross-modal interaction information across different modalities.To address this problem,this paper proposes a new MER method combining audio and text based on deep learning techniques.The main contributions are summarized as follows:(1)For the audio emotion recognition task,an audio emotion recognition framework based on Bi LSTM with Multi-head Self-attention mechanism(AudioBi LSTM Multi-head Self-attention,A-BLMHA)is developed.Firstly,hand-crafted audio features such as Mel Frequency Cepstrum Coefficients(MFCCs)are extracted.Then we employ a Bi LSTM model to learn deep audio features from the extracted MFCCs in an effort to capture context-dependent temporal features in audio samples.Finally,the Multi-head Self-attention mechanism is employed to aggregate the hidden state information from deep audio features so as to further learn more discriminative audio features for the final audio emotion classification task.Experiment results on the IEMOCAP dataset show that the designed audio emotion recognition model achieves better results.(2)For the text emotion recognition task,a text emotion recognition framework based on BERT(Bidirectional Encoder Representation from Transformers)language model and Multi-head Self-attention mechanism(Text-BERT Multi-head Self-attention,T-BMHA)is presented.The pre-trained BERT model is first fine-tuned to extract 768-dimensional text emotional features,which contain context-dependent temporal information.Then,the Multi-head Self-attention mechanism is also introduced to focus on the salient emotion features of the text.Experiment results on the IEMOCAP dataset reveal that the designed text emotion recognition model achieves better results compared with other models.(3)For the MER task combining audio and text,this paper individually constructs a AT-FLF(Audio-Text Feature-Level Fusion)framework,a AT-DLF(Audio-Text Decision-Level Fusion)framework and a AT-MLAN(Audio-Text Multi-Level Attention Network)framework.In particular,the AT-MLAN framework is divided into three main modules: the feature extraction module,the cross-modal attention module,and the emotion classification module.Experimental results on the IEMOCAP dataset indicate that the AT-MLAN framework outperforms the AT-FLF framework and the ATDLF framework,demonstrating the effectiveness of our proposed model.In comparison with existing multimodal information fusion methods,the AT-MLAN framework outperforms other used methods.This paper aims to separately model and optimize audio modality,text modality and multi-modality for MER based on deep learning.Through a large number of experiments such as comparative experiments,ablation experiments,and result visualization,it is proved that the proposed model-level fusion framework based on multi-level attention network is effective,and exhibits superiority to other recent multimodal information fusion methods.In the future,more modal information such as video and physiological signals can be combined for more in-depth MER research. |