| With the development of deep learning,video understanding tasks have become complex and diverse,and context modeling has become the focus of research on video content analysis.Current video context modeling methods mainly use various variant structures of recurrent neural networks to analyze video content.However,there will be problems of incomplete and in-depth contextual information mining.Based on video anomaly detection and video question answering tasks,this paper studies the problem of video context modeling.First,in the video anomaly detection task,the distribution of abnormal events in surveillance videos is uneven and the definition of abnormal events heavily depends on the context.Aiming at the problem that existing models are difficult to identify complex abnormal behavior information in video sequences,this paper proposes a method based on graph convolution multi-level context modeling.In the aggregation feature stage of the graph convolution,the non-local similar features of the node pair and the temporal local features are combined to obtain multi-level context features.In the feature extraction stage of the graph network,instance normalize for non-local attention module selects the necessary information to solve the problem of missed detection of abnormal events in the long video.Experiments on two larger benchmark datasets verify the effectiveness of the algorithm.Second,in the video question answering task,the video moves frequently in a small space-time range.It is difficult to express rich and complete video context features for existing models.This paper proposes a dual-branch structure complementary multilevel context modeling method.The first branch uses Transformer with relative position representation to construct connections at different moments in a video sequence,learns multi-scale video features,and enhances the expression of non-local features.The second branch is the proposed de-redundancy module,which drives the subsequent video clips to learn different information,enhances the expression of local features,and solves the problem of the expression of complex video semantic features in short videos.Experiments on three benchmark datasets verify the effectiveness of the algorithm. |