Font Size: a A A

Multimodal Emotion Recognition Based On Multi-Scale Feature Fusion

Posted on:2024-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z D ZhaoFull Text:PDF
GTID:2568306941992099Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As research on deep learning technology continues to progress,artificial intelligence technology is gradually empowering various fields.In order to achieve a more natural humancomputer interaction experience,how to accurately recognize the emotional state of speech interaction has become a new research hotspot.Speech sequence modeling methods based on deep learning techniques have facilitated the development of emotion recognition,but the mainstream methods still suffer from poor cross-cultural language adaptation and low recognition accuracy.There are three main reasons for this:(1)the loss of speech signal resolution in the feature extraction stage leads to difficulties in timing analysis;(2)multimodal recognition methods have difficulties in learning the correlation between common intervals of emotion.In this dissertation,a speech emotion recognition method based on multi-scale feature fusion and multi-modal feature alignment is designed,specifically:(1)A recognition model based on a multi-scale feature pyramid network(MSFPN)is proposed for the problem of temporal dynamic resolution loss in the cascaded depth feature extraction module.The model first extracts multi-level and multi-scale features in the speech emotion recognition domain,uses a forward fusion mechanism to achieve multi-scale feature fusion in the same layer,and a backward fusion mechanism to achieve feature fusion in different layers and recover the temporal dynamic resolution,and then uses a long short-term memory neural network(BLSTM)to learn temporal dynamic changes to obtain a discourse-level integrated emotion representation.(2)For the traditional methods in learning multi-modal co-occurrence,a multimodal feature alignment interaction network(MAIN)model based on shared-weight gated neural network(WS-GRU)and temporal correlation attention mechanism(SMA)is proposed.First,word-level alignment of speech and text features is completed using an attention-based feature alignment method;then,WS-GRU is used to learn word vector weights to highlight multimodal sentimentrelated regions and complete discourse-level feature alignment;finally,speaker features are introduced and contextual sentiment commonality between multimodal information is learned using SMA to further improve the accuracy of sentiment state recognition in integrated contexts.The proposed MSFPN,a multi-scale feature pyramid-based speech emotion recognition model,and MAIN,a multimodal feature alignment-based speech emotion recognition model,improve the performance of speech emotion recognition systems and the ability to capture emotionally detailed features.The experiments show that the UA of the MSFPN model improves by 0.8% and 1.94% on the IEMOCAP and EMO-DB corpora respectively,compared to the state-of-the-art methods in the field in recent years,while the UA of the MAIN model improves by 0.4%,1.86% and 2.80% on the IEMOCAP with LOSO,LOPO and RA data segmentation settings respectively.Thanks to the learning of temporal dynamics and multimodal commonality,the MSFPN and MAIN models achieved better recognition performance on the speech emotion recognition and multimodal emotion recognition tasks,respectively,and were able to learn more discriminative emotion features.
Keywords/Search Tags:Speech Emotion Recognition, Multi-scale Feature Fusion, Multimodal Feature Alignment, Mutual Cross-attention
PDF Full Text Request
Related items