Font Size: a A A

Human Action Localization And Recognition In Complex Videos

Posted on:2019-09-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:H SongFull Text:PDF
GTID:1488306470493564Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Automatically analyzing and understanding video content is a highly active research area in computer vision and pattern recognition.It has many applications ranging from video retrieval and video surveillance to smart health care and human computer interaction.Human action localization and recognition is an important part of automatic video analysis.This work focuses on the human action localization and recognition in complex videos.It is challenging due to the camera movement,multi-obj ect motion,cluttered background,and dynamic texture in complex videos.This thesis presents the analysis of key segments,model design and video representation for action localization and recognition,and abnormal event detection in complex videos.Firstly,a novel key segment-based method is proposed for event detection in complex videos.An adaptive latent structural SVM model is employed to locate the key segments which contain human actions,objects and scenes,where the locations of key segments are treated as latent variables.The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos,modeling the tem-poral information among video segments,and the relationships between segments and com-plex events.In order to alleviate the time-consuming and labor-expensive manual annotation of huge mounts of training videos,a large number of loosely labeled Web images as well as videos and a limited number of complex consumer videos are utilized.A set of semantic concepts which describe actions are automatically learned with the tags of images and the descriptions of videos.The experiments show that the method can improve the performance on event detection.Secondly,a framework of temporal action localization and recognition in long untrimmed videos is presented,called "Action Pattern Tree".The action pattern trees can produce the temporal boundaries of actions and exploit the temporal information between segments of videos based on the label vectors of segments,by learning the occurrence frequency and or-der of segments.In order to obtain the labels of video segments,deep neural networks based on the 3D convNets are introduced to annotate the segments by simultaneously utilizing the spatio-temporal information and the high-level semantic feature of segments.The experi-mental results on the temporal action localization datasets demonstrate that the action pattern trees can effectively localize and recognize the actions in videosThirdly,a deep network named Dual Attention Neural Network(DANN)is proposed to solve the problem of temporal action localization and recognition in long untrimmed videos The proposed DANN consists of two modules:a multi-feature fusion module and a video segment analysis module.The feature fusion module dynamically combines the static infor-mation and the spatial-temporal information of the segments in videos to produce informative feature vectors.In the segment analysis module,each segment is assigned a weight to repre-sent its contribution to temporally localizing actions and the weight is automatically computed by concatenating two attention layers.The final video representation is generated with the network for temporal action localization.In the process of determining the temporal bound-aries of actions,the segments with high weight are maintained,and the low weight segments are eliminated.The experiments on the THUMOS2014,MSR Actionâ…ˇ and MPâ…ˇ Cooking datasets validate that the dual attention neural network can deal with long videos with any temporal length,and perform better than other methods on temporal action localizationFinally,an adversarial attention-based autoencoder network called "Ada-Net" is pre-sented to detect abnormal events in complex videos.The goal of abnormal event detection is to recognize abnormal actions or objects in videos,and localize the positions of anoma-lies.The network is an end-to-end trainable unsupervised network,and is proposed to dis-cover normal motion patterns in videos.The abnormal events are detected by computing the reconstruction errors between the original frames and the reconstructed frames.In the autoencoder,spatial convolutional layers and a stack of convolutional LSTMs are designed to produce the encoding feature maps by capturing both spatial structures within frames and temporal relationships between sequential frames.Attention-based convolutional LSTMs and de-convolutinal layers are utilized to decode the encoding feature maps to reconstruct the original frames.The attention mechanism can dynamically select the important informa-tion of the encoding feature maps and the last decoding hidden states for decoding.Instead of the simple Euclidean distance between the original frames and the reconstructed frames,a generative adversarial network is utilized as an effective regularization to guide the recon-struction.The experiments show that the Ada-Net can accurately detect the abnormal events in complex videos.
Keywords/Search Tags:human action localization, human action recognition, complex videos, abnormal event detection, dual attention network, action pattern tree, adversarial autoencoder network
PDF Full Text Request
Related items