Font Size: a A A

Human Action Recognition In Videos With Deep Learning

Posted on:2023-11-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:H B WuFull Text:PDF
GTID:1528306617958769Subject:Navigation, guidance and control
Abstract/Summary:PDF Full Text Request
Video-based human action recognition is one of the research hotspots in the field of computer vision,which has far-reaching theoretical research significance and broad practical application prospects.Due to the problems of rich inter-class changes and obvious intra-class differences caused by the diversity of human actions,and the problem of inefficient extraction of spatio-temporal features caused by interference factors such as complex background environment,view and illumination changes in the actual scenes,video-based human action recognition has always been a very challenging research topic.Recently,thanks to the continuous improvement of computer performance and the explosive growth of video data on the internet,data-driven deep learning technology has developed rapidly and become the mainstream method for human action recognition task.At present,deep learning based human action recognition has carried out a lot of research work and made remarkable progress,but there still exist the following shortcomings:(1)Action recognition methods based on deep convolutional neural networks(CNNs)tend to predict actions according to the appearance characteristics of the scenes and objects,which is easily influenced by the cluttered background.And these methods also can not automatically focus on the informative motion regions of human actions,which leads to the limited recognition performance;(2)Video action recognition highly depends on the effective spatio-temporal feature learning.The existing 2D CNNs are skilled to extract rich spatial information in videos.However,they lack the ability to directly model the temporal structure;(3)3D CNNs can learn the spatial and temporal features jointly.But they contain a large number of parameters,which will increase the model complexity.Moreover,almost all existing 3D CNN-based methods recognize human actions only using RGB videos,which limits the recognition performance.Aiming at the above problems,this paper carries out in-depth algorithm researches.The main research works are summarized as follows:(1)Hierarchical dynamic depth projected difference images-based action recognition in videos with convolutional neural networks.Aiming at the problem that 2D CNN based methods extract the spatial and temporal features of video actions separately and the representation of spatio-temporal information is not efficient enough,this dissertation focuses on the depth video action recognition and proposes an efficient video representation named hierarchical dynamic depth projected difference images(HDDPDI).Firstly,the depth video sequence is projected into three orthogonal cartesian views,and then rank pooling is applied to hierarchically encode the spatio-temporal motion changes in each projection view from different temporal scales.The generated HDDPDI representation captures the spatio-temporal information of human actions simultaneously from different perspectives and different temporal scales,which can effectively describe the 3D motion patterns of depth video actions.The multi view HDDPDI is fed into 2D CNNs for spatio-temporal feature learning.At the same time,three multiview information fusion schemes are designed based on different network layers to realize action recognition.The experimental results on three public human action datasets show that the HDDPDI video representation contains rich spatio-temporal motion information,which enables CNN to learn more comprehensive action features,and the multiview fusion can significantly improve the performance of action recognition in depth videos.(2)Convolutional networks with channel and STIPs attention model for action recognition in videos.Aiming at the problems that CNNs lack the ability to model the longterm temporal dependency of an entire video and are insensitive to the informative motion regions of human actions,this dissertation proposes a channel and spatial-temporal interest points(STIPs)attention CNN,and meanwhile,also proposes dynamic image sequences to represent video actions,which can effectively describe the long-term spatio-temporal dynamics of the whole video by modeling the local short-term spatio-temporal structure.The channel and STIPs attention model includes two parts:channel attention and STIPs attention.Channel attention assigns different weights to different channels to strengthen the discriminative channels in the network by automatically learning multichannel convolutional features.STIPs attention projects the spatiotemporal interest points detected from the dynamic image into the feature map space to generate spatial attention weights to focus on the salient motion region.The channel and STIPs attention model can be flexibly embedded into CNNs to enhance the feature representation ability.Based on the enhanced convolutional features,long-short term memory(LSTM)network is utilized to model the temporal dependency and finally output the action prediction.The experimental results show that the proposed method makes full use of the multi-channel and spatial characteristics of convolutional features,and can extract the discriminative spatio-temporal information to significantly improve the performance of videobased human action recognition.(3)Spatiotemporal multimodal learning with 3D CNNs for video action recognition.Aiming at the current situation that almost all existing 3D CNN-based methods recognize human actions only depending on a single RGB data modality,which limits the performance of 3D networks,this dissertation proposes a novel multimodal two-stream 3D network framework to explore the spatio-temporal feature learning ability of 3D CNNs for depth and pose data,and further combine the complementary information of different data modalities to improve the recognition performance.The proposed method constructs the depth residual dynamic image sequence(DRDIS)and pose estimation map sequence(PEMS)as multimodal video action representations.DRDIS models the salient spatio-temporal motion patterns of human actions with a set of dynamic frames.PEMS intuitively describes the spatio-temporal evolution of body posture with a set of color-coded pose images.The experimental results based on four human action datasets show that 3D CNNs can effectively learn the spatio-temporal information in depth and pose data,and multimodal fusion is helpful to enhance the performance of video action recognition.(4)Multi-level channel attention guided spatio-temporal motion learning for human action recognition.Aiming at the problem that most existing action recognition methods learn spatio-temporal cues on convolutional feature maps without jointly considering the channel discrimination,this dissertation proposes a multi-level channel attention guided spatio-temporal motion learning(MCA-STML)module to effectively capture the spatio-temporal evolution of human actions under the guidance of channel attention.This module consists of two stages:multi-level channel attention excitation(MCAE)and spatio-temporal motion modeling(STMM).MCAE generates the motion-aware channel relations from video convolutional features at both frame and video levels.STMM captures the spatial movements bidirectionally along the time dimension in partial motion-salient feature channels guided by MCAE.MCASTML module can model the spatio-temporal structure effectively and flexibly,and can be embedded into many popular 2D networks with very limited additional computing cost to enhance their spatio-temporal modeling ability.The experimental results show that the proposed method can effectively enhance the spatio-temporal motion learning ability of 2D networks and obtain competitive action recognition results.
Keywords/Search Tags:Human action recognition, Deep learning, Convolutional neural network, Attention mechanism, Spatiotemporal feature learning
PDF Full Text Request
Related items