| With the progress of communication technology and the popularity of social platforms such as Facebook and wechat,a large amount of multimodal data representing emotions will be generated in People’s Daily social life.It has spread from computer science to business marketing,finance,political science,health science and even social and natural sciences like history.Text is the most commonly used information for sentiment analysis.It expresses emotions through words,phrases and syntactic relationships.In some cases,however,it can be difficult to accurately judge emotions from text messages.The interaction between text and audio modes and video modes can provide richer information and obtain more emotional characteristics.Because fusing information can provide more emotional signatures,it often improves the accuracy of overall results or decisions.The data of each mode usually exhibit the nature of misalignment due to the inherent data misalignment due to the variable sampling rate of the sequence from each mode.Therefore,the long-term dependencies of each mode need to be inferred.Deep learning is the core field of the current emerging development of artificial intelligence,which is a breakthrough in the traditional machine learning field.In the case of sparse data,satisfactory results can often be obtained.The development of pre-training models has brought natural language processing into a whole new phase of development,using large-scale unlabeled corpora from which good representations can be learned and then used for other tasks.Thus,the phenomenon of model overfitting caused by fewer data sets is solved.This thesis first introduces the method of feature extraction from multimodal data combined with deep learning algorithm,using unified pre-training model,parameters can be fine-tuned according to downstream tasks,and introduces a new video language modeling model.This thesis proposes a new text sentiment analysis model combining attention module and graph neural network,which is the most common carrier in sentiment analysis.Finally,on the basis of the text modal,the focus of this research is proposed based on deep learning multi-modal emotion analysis,and two different crossmodal attention structures are designed.Compared with text mode,the accuracy of the proposed method is improved,and compared with other multi-mode models,the effectiveness of the proposed method is proved objectively.The main work includes:(1)Feature extraction of audio,video and text is carried out based on deep learning,and pre-training model is used uniformly.For text modes,a more robust Ro BERTa model is used,which contains more diverse data and dynamic mask strategies.For audio features,wav2vec2.0 is used.Using the strategy of mask,let the model learn to recover the masked part,so as to achieve the effect of pre-training.For video modes,in order to solve the problems of disconnection between upstream and downstream task domains,lack of connection between text languages and low computational performance,a new video language learning representation is introduced,which adopts sparse sampling strategy.video swin transformer is added to keep the spatiotemporal information of the video frame,and d VAE is used to make the model learn to recover the covered video area.the pretraining model supports feature extraction from static images.In this thesis,a new feature extractor is designed,which takes the last layer of the pre-training model as input,obtains sentence representation by means of convolution and pooled crossover,and adds RELU activation function to increase nonlinear representation.(2)A novel text sentiment analysis model combined with deep learning algorithm is proposed.The model includes pre-training model,graph convolution network and fusion attention module.The weight of pre-training features was dynamically adjusted by graph convolutional network,and the accuracy of text sentiment analysis in CMU-MOSI data set reached 82.9%.(3)A new method of multimodal emotion analysis is proposed based on the pretraining model of multimodal feature extraction based on deep learning.The model includes span mode attention block,self mode attention block,layer normalization,graph convolutional network and residual structure.Two different span mode attention structures are designed.Experiments show that the model can effectively pay attention to the cross attention of different modes and the long term dependence of distance when the data is not aligned.The analysis shows that the cross-modal model can obtain richer emotional information than the single modal model,and it has the best performance compared with other multi-modal models,with the accuracy of 83.8% in CMU-MOSI.The feasibility of this method is proved objectively. |