Font Size: a A A

Multimodal Emotion Recognition Based On Feature Fusio

Posted on:2023-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y R XuFull Text:PDF
GTID:2568306833965089Subject:Control engineering
Abstract/Summary:PDF Full Text Request
In the field of human-computer interaction,emotion recognition of human is a challenging problem,and it is also a key link to achieve barrier-free communication between human and machine.In the emotion recognition,the intelligent machine collects the target object’s language expression,facial expression and behavioral action data through the sensor for analysis,and then deduces the current emotional state of the detected object.At present,most of the emotion recognition algorithms are constructed based on single-modal social information,and the recognition results are one-sided and easily disturbed.The recognition accuracy is often difficult to meet the practical requirements after being separated from specific social environment conditions.Based on the above situation and problems,the emotion recognition mode of multimodal fusion has been widely concerned.It extracts the heterogeneous data cues in the interaction process in parallel and judges the emotional state of social objects through comprehensive processing of the above cues.Multimodal emotion recognition technology in principle conforms to the basic principle of social dynamics,but the current study there are still some problems,such as a variety of different feature extraction data makes it harder to social clues,to extract the intrinsic connection between different signal on the difficulties and different modal data fusion method and principle of the mechanism are still to be clear,and so on.In addition,most of the existing studies are based on the emotion expressed in English,and there are relatively few studies on Chinese expressions which are difficult to understand.Therefore,this article will be more Chinese modal emotion recognition model as the main breakthrough point,in text,audio and facial expression,the feature information of the three independent data as the research object,give full consideration to the connection between the context information,by capturing single-mode independent characteristic information and multimodal information fusion between build multimodal emotion recognition model.The main work and contributions of this paper are as follows:(1)The parallel convolution module and the attention-based bidirectional long short-term memory module are proposed.The parallel convolution module uses different convolution and pooling methods to fully extract the feature information of the upper layer and fuse the extracted feature information,and realizes the network lightweight to a certain extent.The attention-based bidirectional long short-term memory module can enhance the extraction of key information and maintain the timing between information.(2)For the emotion recognition models of audio and text,parallel convolution module and attention-based bidirectional long short-term memory module are added to the models to achieve network lightweight and ensure the timing of information;For the expression emotion recognition model,in order to realize the expression feature extraction of continuous video frames and reduce unnecessary calculation loss,3D convolution and 2D convolution are mixed,and attention-based bidirectional long short-term memory module is added at the end of the model to maintain the time sequence of information.(3)In order to achieve sufficient and effective information complementarity between different modes,through many experiments.For the fusion of audio and text information,the fusion method is used before the second input of the parallel convolution module.For the fusion of audio and text with facial expression information,the fusion method is adopted before the first input of bidirectional long short-term memory network.(4)Based on Chinese emotion recognition in real environment,the CH-SIMS dataset was used to verify the validity of the emotion recognition model proposed in this study.The accuracy of audio recognition,text recognition and facial expression recognition is 74.70%,77.13% and 87.81% respectively in single mode emotion recognition model,and 93.92% is the highest in multimodal emotion recognition model.It is proved that multimodal has great advantages in emotion recognition.
Keywords/Search Tags:emotion recognition, parallel convolution module, 3D convolution, multimodal
PDF Full Text Request
Related items