Font Size: a A A

Research On Video Action Recognition Based On Deep Learning

Posted on:2020-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:C LinFull Text:PDF
GTID:2428330572973684Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of video data,it is increasingly urgent to realize efficient and intelligent analysis and processing of video data.Automatic video action recognition is the key technology for video understanding,analysis and processing.It has wide application prospects in security monitoring,intelligent medical treatment,human-computer interaction and other fields.It is also one of the research hotspots and difficulties in the field of computer vision.The expansion of visual action from 2D(Two dimension)image space to 3D(Three dimension)space-time increases the complexity of action expression and subsequent recognition tasks.Iin the traditional pattern recognition paradigm,the method based on hand-crafted feature has played a dominant role in traditional video action recognition research.However,this kind of video action recognition based on hand-crafted feature has several shortcomings,such as complex feature design and extraction,poor robustness of features,and vulnerability to the impact of optical environment.With the availability of large-scale video data and the promotion of high-performance parallel computing,deep learning has gradually become the mainstream development direction in this field.In order to achieve better performance,existing video action recognition method based on deep learning often adopt a "Two-stream" architecture,which divides the network model input into spatial feature flow and temporal flow.The spatial flow uses RGB image as input,and the temporal flow uses a pre-extracted optical flow feature map as input.However,the pre-extraction of optical flow features consumes too much time,and the robustness of optical flow is easily affected by camera motion,which limits the practicability of the two-stream architecture.Besides,the direct process of original video using 3D convolution network is also one of the main research directions in this field.Although this method greatly simplifies the process of video action recognition and improves the processing speed,there is a large distance between the recognition performance and the Two-stream architecture.Based on the analysis of the advantages and disadvantages of the existing video action recognition models based on deep learning,this paper innovatively designs an end-to-end deep video action recognition model(DVARN),which combines self-attention mechanism with spatio-temporal feature extraction network from the perspective of model practicability and recognition performance.The main contents and innovations of this paper include:(1)This paper designs an effective local spatiotemporal feature extraction network.Using Inception V1 network,batch normalization and residual idea,a local spatio-temporal feature extraction network based on 3D convolution is designed.Effective transfer learning is used to improve the fitting and generalization ability of feature extraction module.When generating the 3D convolution kernels,the parameters trained on the ImageNet dataset of static image classification are used to initialize the 3D convolution kernels.For the training of the model,the pre-training is carried out on the Kinetics-400 dataset,and then the adjustment and optimization is carried out on the UCF-101 dataset.Experiments show that the above methods can effectively accelerate the training speed of the model,and also improve the generalization ability and recognition performance of the model.(2)This paper studies and designs a global dependency perception mechanism based on self-attention mechanism.The self-attention mechanislm used in the field of natural language processing is used to nmodel the context dependency of video seqtuences.On the basis of effective spatio-temporal feature extraction,an improved self-attention mechanism is used to mine the corresponding relationships within feature sequences,which enables feature sequences with only local perception ability to model long-term dependency with a global perspective after processing.It solves the problem that existing video action recognition models only rely on stacked network layers to expand the perceptual field in time domain and lack effective fusion constraints.(3)In this paper,end-to-end deep video action recognition model DVARN is proposed.The model combines local spatio-temporal feature extraction and long-term dependency analysis into the same model,which can directly process the original RGB image sequence without additional pre-extraction of hand-crafted feature.At the same time,the model fuses and judges the recognition results directly at the level of middle sequence features,which further speeds up the recognition speed of the model in prediction.Experiments on benchmarks show that the proposed deep video action recognition model has very good recognition performance and processing speed.
Keywords/Search Tags:deep learning, video action recognition, self-attention
PDF Full Text Request
Related items