| With the development of internet technology,the popularization of 5G applications and the rapid growth of various social media applications,a large amount of multimedia video data emerges every day.Among them,speech and vision are the two most intuitive modalities of data.How to effectively utilize the emotional information contained in these two modalities is a difficult and important aspect of research.Therefore,emotion recognition technology based on these two modalities has become a current important research direction.In view of the above problems,this paper puts forward the following work:(1)To address the problem of insufficient and incomplete extraction of emotional cues in the speech modality,a network based on multi-head self-attention and fusion of dynamic and static features is proposed to achieve deep emotion feature extraction in the speech modality.Firstly,the Spectral Over-Subtraction method for speech enhancement and dual-threshold endpoint detection is used to preprocess the speech data and effectively remove invalid segments in the audio.Then,3LMFCC dynamic features are extracted and global temporal modeling is performed using an improved multi-head self-attention network,which is concatenated and fused with the static features extracted using open SMILE.Finally,a softmax classifier is used to recognize the emotional category of speech samples.Experimental results show that this method can effectively extract and fuse the dynamic and static features of speech data,significantly improving the performance of speech emotion recognition.(2)To address the problem of difficult extraction and fusion of spatial-temporal features in the visual modality,a network based on the combination of C3 D and multi-head self-attention is proposed to achieve deep spatio-temporal feature extraction in the visual modality.Firstly,key frames are selected by using local average information entropy to deal with the difficulty of extraction.Then,the C3 D network model is used to extract shallow spatio-temporal features of the visual modality for the obtained key frames.Subsequently,a multi-head self-attention network is used to further extract deep spatio-temporal features.Finally,a softmax classifier is used to recognize the emotional category of visual samples.Experimental results show that compared to traditional methods,this method can effectively extract visual spatio-temporal features and achieve better visual emotion recognition performance.(3)To address the problem of difficulty in effectively fusing speech and visual modalities in video data,a bimodal emotion recognition method based on weighted decision fusion is proposed.Firstly,the designed emotion recognition model for the speech and visual modality respectively obtains a unimodal emotion prediction vector,and both unimodal emotion prediction vectors constitute the emotion prediction matrix of the sample.Then,global and local weight coefficient matrices are used to reassign the decision weights of each modality.Finally,the predicted probabilities of each emotional category for each type of sample obtained by weighted decision fusion are compared,and the emotional category with the highest predicted probability is selected as the final prediction result.Experimental results show that compared with unimodal emotion recognition methods and feature-level fusion emotion recognition methods,this method can fully utilize the related emotion information between modalities,further improving the performance of emotion recognition. |