Font Size: a A A

Multimodal Sentiment Analysis Based On Multichannel Convolutional Neural Network

Posted on:2024-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z G WuFull Text:PDF
GTID:2568307136492874Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Emotional analysis has always been one of the hot research topics in the fields of image,video,and natural language processing.In recent years,with the development of deep learning and the rapid arrival of the Internet era,various social media and e-commerce platforms have emerged.Platforms such as Weibo,Twitter,Tik Tok,forums,and Taobao have gradually become the main platforms for people’s daily entertainment and emotional expression.Therefore,it has also brought a large amount of information with distinctive personal emotional characteristics,including text,video,and audio.Emotional analysis has become particularly important in enhancing user experience,helping businesses improve service quality,personalized recommendations,and government departments to monitor public opinion.Early researchers often used traditional methods to process emotional data,and the research mode was single-modal,resulting in poor recognition performance,low accuracy,and difficult application to practical scenarios.With the development of deep learning and the improvement of neural networks,considering the complementary information of different modalities for a single modality from multiple perspectives is undoubtedly a better approach.Therefore,more and more researchers have begun to try to use multi-modal emotional information to improve the results of emotional analysis.In the context of deep learning-based cross-modal emotional analysis research,the research content and achievements of thesis are as follows:(1)A method for audio emotion analysis is proposed,which combines spectrogram and statistical features using a dual-channel end-to-end network structure.The method processes the raw audio information to obtain the spectrogram,and uses a Convolutional Neural Network(CNN)to extract emotion features while considering the influence of different convolutional kernels on feature extraction.The first layer convolutional kernels are split and different scale convolutional kernels are used to extract temporal and frequency domain information from the spectrogram.Additionally,low-level descriptors(LLDs)are extracted from the raw audio signals and statistical features(High-Level Statistics Functions,HSF)are obtained through function operations on LLDs.The two features are adaptively fused,allowing the model to autonomously select the more helpful part for emotion analysis.Classification is performed using fully connected and normalization layers,and the emotion polarity is obtained.The method combines traditional statistical features with the spectrogram to improve model robustness.Experimental results show that the model has good performance in both accuracy and F1 score.(2)A method for image emotion analysis is proposed,which combines facial encoding and context-awareness.For image information,the method first uses a Multi-task Convolutional Neural Network(MTCNN)to clip the facial and remaining background images.The facial image is then used to extract emotion features through an improved 3D convolutional layer.The background image is also processed through 3D convolutional feature extraction,and the context-awareness feature containing the attention influence is obtained through attention mechanism.The two features are then fused and classified.Experimental results show that the method not only obtains emotion features through facial expressions,but also considers body language during conversation,resulting in significant performance improvement in accuracy.(3)A cross-modal BERT model based on visual,audio,and text fusion is proposed for emotion analysis.The method adds other modalities and fuses them on the basis of single-modal emotion analysis.The method first inputs the image and audio to their respective sub-networks for feature extraction and dimension control.The Masked Multimodel Attention(MMA)module is then used to fuse the image and audio features through self-attention,obtaining the bimodal attention matrix.Text information is preprocessed through the BERT model to extract features and input to the MMA module,where it is fused with the bimodal attention matrix and normalized to obtain the multimodal attention weight matrix.Finally,the weight matrix is masked and combined with the initial text feature to obtain the cross-modal emotion classification result.Experimental results show that the cross-modal fusion-based emotion classification algorithm performs better than single-modal emotion classification algorithms,and has better performance than other multimodal fusion models for emotion recognition on three public datasets.The thesis proposes two different methods for audio and image sentiment analysis,respectively,to cope with different scenarios,and tests them on different modal datasets,achieving the highest accuracy of 71.86% and 77.95%,respectively.Considering the limitations of single-modal sentiment analysis,a multi-modal fusion sentiment analysis method is proposed,and the experimental results show the highest accuracy of 85.2%.
Keywords/Search Tags:Cross-modal Sentiment Analysis, Attention Mechanism, Convolutional Networks, Adaptive Fusion, Pre-trained Model
PDF Full Text Request
Related items