Font Size: a A A

Research On Human Action Recognition Technology Based On 3D Convolution

Posted on:2022-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:R P WangFull Text:PDF
GTID:2518306335987349Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
With the development of economic society and hardware performance,video analysis tasks have received more and more attention.At the same time,human behavior recognition technology has been widely used in virtual reality,video surveillance,video retrieval and other fields.In the context of the success of deep learning and image classification tasks,a dual-stream network based on deep learning and a 3D convolution that can simultaneously extract spatiotemporal features have emerged.In recent years,many human behavior recognition technology methods such as C3D(3D Conv Net),I3D(Inflated 3D Conv Net),S3D(Separable 3D Conv Net)models have adopted the form of three-dimensional convolution.However,increasing the dimension while extracting effective spatio-temporal features brings the following types of problems: the number of parameters has increased sharply,and the GPU is strongly dependent;the background is relatively complex,and the perspective changes caused by the same type of behavior have a large difference;different behavior The movement trajectories are highly similar among the others.In order to solve the above problems without loss of accuracy,this paper makes the following improvements based on existing work:(1)S3D network optimozation: The dimensional expansion of the convolution kernel leads to a sharp increase in the number of parameters and an increase in GPU computing load.To solve this problem,scholars proposed an S3 D convolutional neural network based on Google’s Inception network.This article optimizes the network and redesigns the number of network layers,network structure and related parameters.(2)OPT-S3 D network: On the basis of the S3 D network,the OPT-S3 D network embeds the SENet(Squeeze-and-Excitation Networks)module into it,and through calculations to give differentiated weights to continuous video frames,it helps the network to identify key frames and useless frames,focusing on and calculating.The feature information of key video frames is extracted,and relatively low computing resources are allocated to weakly correlated or irrelevant frames to reasonably improve training accuracy and efficiency,that is,the attention mechanism module is introduced.The effectiveness and feasibility of this module are proved through experiments.(3)ATC(Adaptive Temporal Compression)module: This paper calculates and judges redundant frames by independently designing an adaptive time axis compression(ATC)module,deletes them,and inputs the remaining video frames as a compressed data set into the deep learning network.Experiments show that this method can not only effectively At the source,the computational and time complexity can be reduced,and it can also solve the conditional problems that require high-performance GPU support during the training process of video analysis tasks,making the training task simpler and the training process more efficient.This article uses UCF-101,Kinetics as the data set,and the experiments on the data set fully prove the feasibility and effectiveness of the OPT-S3 D network and ATC module.
Keywords/Search Tags:Human action recognition, Deep learning, 3D convolution, Adaptive temporal compression technology
PDF Full Text Request
Related items