Font Size: a A A

Video-based Human Action Recognition And Prediction

Posted on:2021-04-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:D WangFull Text:PDF
GTID:1528307100974649Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Humans have an enormous capacity for recognizing what action is happening now and predicting what actions are about to happen in the near feature through the visual system,which is a critical ingredient for enabling us to interact with other people effectively and avoid some dangers timely,such as cooperation between basketball players and braking before a rear-end collision.For artificial intelligence systems,simulating the ability of action recognition and prediction in the human visual system can expand the application range of the intelligent system and greatly improve the efficiency of human-computer interaction,which is one of the ultimate goals of artificial intelligence research.Compared with human vision systems,artificial intelligence systems use video data captured by vision sensors to describe human daily actions.Therefore,the video-based human action recognition and prediction has attracted extensive attention from researchers and become an important research topic in computer vision field.The key of video-based human action recognition and prediction is to model the temporal dynamic information of human actions.Due to different personal habits,varied action posture,and illumination variation,there will be a significant difference between the same actions under different actors or scenes.For addressing this problem,researchers tried to model temporal dynamic information by learning accurate and robust human action feature representations from video data,which is the major research direction in this field.This thesis takes video-based human action recognition and prediction as tasks and studies human action feature representation learning in video data,and several new methods and results have been obtained.The main contributions of this thesis are as follows:1.For the motion information modeling in complex dynamic scenes,this paper explores the complementary relationship between motion direction and motion magnitude in action recognition,which alleviates the background motion noise problem of dynamic scenes.The experiments on a real traffic scene abnormal action detection dataset show significant performance.Specifically,an anomaly detection method in traffic Scenes based on spatial-aware motion reconstruction is proposed.To tackle motion noises in dynamic scenes,this method proposes complementary motion representation describes the motion orientation and magnitude of the object,respectively.Additionally,the spatial localization of an object is taken into account considering the sparse reconstruction framework to detect motion orientation and magnitude anomalies,and anomalies of motion orientation and magnitude are adaptively weighted and fused by a bayesian model.2.For the cross-modal feature fusion problem,this paper studies the complementary fusion mechanism of object appearance and motion features,alleviating the ambiguity problem in cross-scenario and multi-category human action recognition.The method achieves state-of-the-art performance on the large-scale human action recognition datasets UCF101 and HMDB51.Specifically,a competitive fusion method of object appearance and motion features is proposed.In order to utilize the complementary relationship between appearance and motion features,this method designs a cross-modal message passing mechanism to achieve an efficient fusion of these features.Moreover,to alleviate the inconsistent feature distribution and asymmetric information in the fusion process,this method introduces a competing feature fusion loss function to train the network in an end-to-end manner.Compared to the traditional two-stream network model,which trains each network independently and ignores the complementary relationship between object appearance and motion features during training,the proposed method explores the complementary relationship between object appearance and motion during training.3.For the long-time dynamic information modeling problem,this paper strengthens the ability of temporal information modeling of recurrent neural network models,overcoming the irrelevant noise disruption problem in long-time video data,and a significant improvement has been achieved on the long-term action recognition dataset HMDB51.Specifically,a memory-augmented temporal dynamic learning model is proposed.This model extends the existing recurrent network with an external memory module,which achieves long-term dynamic information interaction and improves the model’s ability to model long-term dynamic information of complex human actions.Moreover,a discrete memory controller is introduced to write the most evident information into the external memory module and ignore irrelevant ones.4.For the problem of modeling long-term temporal structure of continuous human action,this paper mines the temporal causal association pattern of the continuous human action,alleviating the discontinuous and incorrect temporal structure in the continuous human action recognition results.Significant improvements have been achieved on 50 salads,GTEA,and Breakfast datasets.Specifically,a multi-stage refinement method for human action segmentation method is proposed.This method proposes a gated forward refinement network,which adaptively finds the errors in the previous results and corrects them according to temporal context information.Moreover,to force the refinement network to focus on temporal structure errors in previous results,a multi-stage sequencelevel refinement loss is introduced to guide the refinement process across the stages by comparing the temporal structure of previous results and refined results,which directly optimizes the segmental edit score via policy gradient method.5.For the early prediction problem of human action,this paper explores the changing rules of temporal dynamic representation,which alleviates the insufficient discrimination problem of incomplete human action representations,and has obtained state-of-the-art performance of early action prediction on UCF101,BIT and UT datasets.Specifically,a feature-based generative adversarial model for action prediction is proposed.This model introduces a temporal residual generator network to enhance the features of the partially observed videos,which can complement the semantic information in the enhanced features.Moreover,the competing discriminator and perceptual classification network are combined to reduce the difference between the enhanced features from partially observed videos and original features from completed videos,and force the enhanced features of partially observed videos are discriminative enough for action prediction.
Keywords/Search Tags:Human action recognition and prediction, Cross-modal feature fusion, Temporal dynamic modeling, Temporal structure exploring, Generative adversarial learning
PDF Full Text Request
Related items