Font Size: a A A

Research On Human Action Recognition Method Based On Deep Learning

Posted on:2024-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:P Y JiaoFull Text:PDF
GTID:2568306941494724Subject:Engineering
Abstract/Summary:PDF Full Text Request
As an important medium of communication,video contains rich information,but in the past,it required manual extraction and classification,which was time-consuming and laborious.With the development of deep learning,many deep learning tasks based on video understanding have emerged,among which human action recognition is one of the most important tasks.In order to overcome the problems of high computational cost,difficulty in handling long-term dependencies,and difficulty in feature fusion,this paper comprehensively utilizes static spatial features in video frames and temporal action features between frames to conduct in-depth research on human action recognition methods.The main research content is as follows:(1)The residual network-based ER3 D model is proposed to address the high computational cost and easy information loss issues of 3D convolution in processing video data.This model is mainly used to extract spatial information contained in videos,reduce the computational parameters of the model by using deep separable convolutions,and adopt a reverse bottleneck structure to reduce the loss of features during dimensional changes.Simultaneously increasing the number of network basic channels and the size of convolutional kernels to improve feature extraction capabilities.In order to ensure the stability of the network,the use times of activation function and normalization layer are reduced.On the UCF101 human behavior dataset,the recognition accuracy of the ER3 D model is 90.6%,which is superior to mainstream recognition models of the same kind,indicating that the model can model spatial features in videos more efficiently.(2)The Temporal Vision Transformer model based on attention mechanism is proposed to address the issue of weak processing ability for long videos in dual stream networks.This model is mainly used to extract temporal information contained in videos.Due to the inability of the model to directly process the entire image,two video frame segmentation methods,sequential segmentation and compressed segmentation,are proposed.The compressed segmentation method better preserves the temporal dimension information in multiple consecutive video frames.The segmented video frames are used as inputs to the model,and a novel spatiotemporal attention mechanism is introduced into their structure to extract temporal action features between video frames.On the UCF101 human behavior dataset,the recognition accuracy of the Temporal Vision Transformer model is 92.4%,which is superior to most mainstream recognition models of the same class,indicating that the model has excellent competitiveness in extracting temporal features in videos.(3)In order to solve the problem that the feature expression ability of a single model extracted into video is not sufficient,a fusion model of spatial and temporal dual channel networks is proposed.The two channels of the fusion model use ER3 D spatial feature extraction network and Temporal Vision Transformer temporal feature extraction network respectively.Because the two networks finally extract the same size of feature maps for classification,Therefore,the two feature maps are fused before passing through the fully connected layer.On the UCF101 human behavior dataset,the recognition accuracy of the fusion model is 93.6%,which is superior to mainstream recognition models of the same class.This indicates that the model can comprehensively utilize the extracted video spatial and temporal features,fuse the advantages between the two features,and further improve the accuracy of human behavior recognition.
Keywords/Search Tags:Deep learning, Human action recognition, Convolutional neural network, Attention mechanism, Feature integration
PDF Full Text Request
Related items