| Sentiment analysis is one of the important research tasks in natural language processing,and multimodal sentiment analysis is a very important direction.With the development of the times and the explosion of short video social media such as Jitterbug and Beeper,people start to express their emotions through forms different from text.While the original sentiment analysis task focuses on the study of sentiment in terms of text,multimodal sentiment analysis collects sentiment features from multiple information sources and combines them with information from different domains for analysis.For example,by extracting and learning intonation,timbre,and other features of speech,we can obtain richer emotional information than simply analyzing text.Multimodality has been widely used in sentiment analysis tasks,especially speech sentiment analysis.Compared with most single-modal emotion expressions,multimodality is more intuitive for human emotions because multimodal information contains more and richer emotional features.In recent years,deep learning has rapidly evolved and become widely used in sentiment analysis tasks.Most researchers have been piggybacking different deep learning models for feature analysis based on various neural networks.Although deep learning models can better extract high-level features of sentiment,the computational cost and training time of the models have also increased.Most of the current research mainly deals with the extraction of speech features,but the accuracy and prediction rate of the models still need to be improved.In this paper,we address the problem of model accuracy and prediction rate,and do an in-depth study in multimodal sentiment analysis.The following is the main research work of this paper.(1)To address the problem of extraction and fusion of speech emotion feature information,we propose a model that uses a hierarchical Conformer network and fused GRU and attention mechanisms to improve the accuracy of emotion analysis.The method consists of two main parts:local feature learning group and global feature learning group.The local feature learning group is mainly used to learn the information of speech emotion features in time and space through the Conformer model and combines the convolution and Transformer networks,which can enhance the extraction of long and short term feature information.Then global features are extracted by AUGRU model and feature fusion is performed by attention mechanism to obtain the weights of feature information.Finally,emotions are identified through the fully connected network layer,and then emotions are classified through the central loss function and softmax function.We validate the performance of the proposed model on the IEMOCAP and RAVDESS benchmark datasets,and the prediction results show that grouping for feature learning at different levels is more accurate in capturing sentiment information and has some improvement for multimodal sentiment analysis.(2)In order to solve the problem of impurity and cross-modal interaction between visual and image features in multimodal sentiment analysis,we propose a multimodal interaction network(MAIN)based on attention mechanism.Specifically,the MAIN network firstly adopts the encoder structure to remove impurities and reduce noise for the input visual and image features.Second,deeper feature information is learned through attention mechanisms and two-way GRUs networks.Finally,multimodal features of the same dimension are output through a pooling layer,and then multimodal feature fusion is carried out.The model was tested on the CMU-MOSEI and CMU-MOSI datasets.Compared with some baseline models,the accuracy of MAIN model and F1 value are improved to some extent.(3)In order to solve the problem of missing important dimension information of original input data,we propose a method of adding data cleaning processing in the process of feature pre-training of different modal data respectively.In sentiment analysis task,meaningless factors such as impurities and noise contained in data will greatly affect the experimental results.Our method can not only effectively remove the emotion-irrelevant noise part,but also simplify the overall architecture of the model algorithm.Validation on multiple datasets shows that feature extraction and pre-training of the raw data can improve model performance and accuracy to some extent. |