Multimodal Emotion Recognition Based On Multi-Scale Feature Fusion

Posted on:2024-06-14

Degree:Master

Type:Thesis

Country:China

Candidate:Z D Zhao

Full Text:PDF

GTID:2568306941992099

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As research on deep learning technology continues to progress,artificial intelligence technology is gradually empowering various fields.In order to achieve a more natural humancomputer interaction experience,how to accurately recognize the emotional state of speech interaction has become a new research hotspot.Speech sequence modeling methods based on deep learning techniques have facilitated the development of emotion recognition,but the mainstream methods still suffer from poor cross-cultural language adaptation and low recognition accuracy.There are three main reasons for this:(1)the loss of speech signal resolution in the feature extraction stage leads to difficulties in timing analysis;(2)multimodal recognition methods have difficulties in learning the correlation between common intervals of emotion.In this dissertation,a speech emotion recognition method based on multi-scale feature fusion and multi-modal feature alignment is designed,specifically:(1)A recognition model based on a multi-scale feature pyramid network(MSFPN)is proposed for the problem of temporal dynamic resolution loss in the cascaded depth feature extraction module.The model first extracts multi-level and multi-scale features in the speech emotion recognition domain,uses a forward fusion mechanism to achieve multi-scale feature fusion in the same layer,and a backward fusion mechanism to achieve feature fusion in different layers and recover the temporal dynamic resolution,and then uses a long short-term memory neural network(BLSTM)to learn temporal dynamic changes to obtain a discourse-level integrated emotion representation.(2)For the traditional methods in learning multi-modal co-occurrence,a multimodal feature alignment interaction network(MAIN)model based on shared-weight gated neural network(WS-GRU)and temporal correlation attention mechanism(SMA)is proposed.First,word-level alignment of speech and text features is completed using an attention-based feature alignment method;then,WS-GRU is used to learn word vector weights to highlight multimodal sentimentrelated regions and complete discourse-level feature alignment;finally,speaker features are introduced and contextual sentiment commonality between multimodal information is learned using SMA to further improve the accuracy of sentiment state recognition in integrated contexts.The proposed MSFPN,a multi-scale feature pyramid-based speech emotion recognition model,and MAIN,a multimodal feature alignment-based speech emotion recognition model,improve the performance of speech emotion recognition systems and the ability to capture emotionally detailed features.The experiments show that the UA of the MSFPN model improves by 0.8% and 1.94% on the IEMOCAP and EMO-DB corpora respectively,compared to the state-of-the-art methods in the field in recent years,while the UA of the MAIN model improves by 0.4%,1.86% and 2.80% on the IEMOCAP with LOSO,LOPO and RA data segmentation settings respectively.Thanks to the learning of temporal dynamics and multimodal commonality,the MSFPN and MAIN models achieved better recognition performance on the speech emotion recognition and multimodal emotion recognition tasks,respectively,and were able to learn more discriminative emotion features.

Keywords/Search Tags:

Speech Emotion Recognition, Multi-scale Feature Fusion, Multimodal Feature Alignment, Mutual Cross-attention

PDF Full Text Request

Related items

1	Research On Feature Fusion Method Of Speech Emotion Recognition Based On Deep Learning
2	Research On Key Technologies Of Multimodal Emotion Recognition Based On Speech Signals
3	The Study Of Multimodal Emotion Recognition Based On Text,Speech And Video
4	Based On Multimodal Feature Emotion Recognition Research
5	Realization Of Multi-modal Emotion Recognition Based On Speech,Expression And Gesture
6	Design Of Emotion Recognition System Based On Multimodal Feature Fusion
7	Research And Application Of Speech Emotion Recognition Technology Based On Feature Fusion
8	Cross-platform Fusion Of Multimodal Features Design And Implementation Of User Alignment System
9	Speech Emotion Recognition Based On Deep Learning And Multi-Feature Fusion
10	Bi-Modal Emotion Recognition Based On Speech And Visual Cues