| With the rise of social media,more and more users tend to share short videos on social platforms to express emotions or opinions.By analyzing the content of online video and the emotion it conveys,it can effectively supervise public opinion on hot events,and it is also convenient for the classification and management of online videos.Therefore,online video sentiment analysis has gradually become a research hotspot.The research of video emotion recognition based on deep learning has achieved some research results,there are still many challenges in online video sentiment analysis.Such as video emotion expression is sparse,extracting the features of the whole video frame for affective computing will increase the computational complexity of the model.In addition,emotion is usually triggered by multiple objects executing specific events in a specific scene.However,existing research methods ignore the emotional relationship between different objects,resulting in the underutilization of visual information.In view of the above problems,this thesis studies the following contents:1.Due to the sparseness of video emotion expression,this thesis proposes an audio-visual emotion recognition framework KFR-AVER based on key frames and key regions extraction.Different from the existing research methods that directly extract the features of the whole video frame,this thesis uses the Graph Attention Network(GAT)to reason the emotional relationship between different regions in video frames,finding the key regions of the video frame.The adaptive Bidirectional Long ShortTerm Memory Network(Bi-LSTM)summarizes video scenes and finds the features of key frames of the video,thereby alleviating the sparsity of the video emotional expression.Then we use the context information of the acoustic features to supplement the visual information.Finally,an adaptive gated neural network is used to fuse the audio and video features.The experimental results show that this method can effectively improve the performance of the model emotion classification.2.Aiming at the complex spatio-temporal relationship between different objects in short online videos,this thesis proposes a spatiotemporal scene graph emotion reasoning network for short online videos.The model extracts the local object features of the video by reasoning on the constructed spatio-temporal scene graph,while capturing the temporal,spatial and spatio-temporal relationship between objects.Then,the knowledge extraction mechanism is adopted in the scene branch,and the object interaction information is used to train the scene branch to provide the global context information lacking in local object features.In addition,due to the complex temporal structure of audio in natural environment,we use Channel Temporal Attention Mechanism(CTAM)to enhance the acoustic feature representation.Finally,the fused audio and video features are used to predict video emotion.The experimental results show that the method proposed in this thesis effectively improves the accuracy of the model’s sentiment classification.3.In order to solve the problem that heterogeneous data is difficult to fuse,and the joint representation learning method reduces the difference between different modalities,this thesis proposes an audio and video emotion recognition method based on shared-private feature fusion(AVSPFF).The main idea is to first map the audio and video features extracted by KFR-AVER model and STSGR model into the modal common subspace and the modal specific subspace,and then constrain the subspace through the loss function to ensure that the information extracted from different subspaces is complementary.The Multi-Head Attention Mechanism is used to integrate information from different subspaces into a feature vector,and finally complete the task of video emotion classification.Finally,we conduct the comparative experiments of variant model and baseline model on public datasets and self-built datasets,and the experiments demonstrate the effectiveness of the proposed model. |