Font Size: a A A

Action Recognition And Temporal Action Localization Based On Attention Mechanism

Posted on:2023-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:F Y ZhaiFull Text:PDF
GTID:2568306848467194Subject:Engineering
Abstract/Summary:PDF Full Text Request
Video action understanding is an important research content in computer vision.At present,it has been widely used in intelligent monitoring,human-computer interaction,VR(Virtual Reality)and other fields.However,due to different factors such as illumination,target scale and action duration,video understanding contains complex space-time dependence and background noise,which makes it difficult to improve the discriminant power of the model.In order to solve this problem,based on the recently popular video understanding algorithm,from the aspects of space-time feature extraction network,time series modeling method,model calculation efficiency and background suppression,The following three improvement schemes are proposed:Firstly,aiming at the limitations of the action recognition model based on Vision Transformer structure in extracting local space-time features and modeling long-term time dependence,and the large memory occupied in model training,a time convolution feature stream shift cache Vision Transformer model is proposed.Firstly,time convolution is introduced into the Vision Transformer model for short-term local time modeling to capture action details.Then,along the time dimension,the spatio-temporal characteristics of some frames of Vision Transformer are transferred to the cache queue to model the long-term relationship dependence.Finally,during model training,back propagation is carried out one video segment by one to reduce the memory occupation.Secondly,in view of the lack of local characteristics when extracting spatial features from the current Vision Transformer action recognition model,which can not better adapt to the multi-scale changes of human actions,and the high amount of calculation and low operation efficiency of the model,a time-series token shift Vision Transformer efficient action recognition model is proposed.Firstly,the depth separable convolution is introduced into the vision transformer to increase the spatial local receptive field,and then the image token is shifted along the time dimension to promote the information exchange between the front and rear adjacent frames,so that the model can learn the 3D spatio-temporal feature representation from the 2D image space without increasing any parameters and calculation.Finally,aiming at the problem that the existing video action localization algorithms are difficult to effectively distinguish between action frames and background frames,a weakly supervised video action temporal detection algorithm guided by graph convolution attention mechanism is proposed.Firstly,the similarity between video frames is modeled by graph convolution,and the attention weight of video clips is learned from the graph convolution feature space.Then,the video clip feature is weighted to improve the action frame specific response and reduce the background frame feature response.At the same time,the video metric loss is used to make similar videos have similar feature representation.Finally,in order to make full use of the video context information,a Bidirectional Gated Recurrent Unit(BiGRU)module is constructed,Make action recognition more accurate.
Keywords/Search Tags:action recognition, temporal action localization, Vision Transformer, graph convolution attention
PDF Full Text Request
Related items