Font Size: a A A

Structured Representation Model Based Human Action Recognition And Motion Segmentation

Posted on:2018-01-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:W J ChenFull Text:PDF
GTID:1318330566967396Subject:Mechanical engineering
Abstract/Summary:PDF Full Text Request
Human action recognition and understanding and motion segmentation are essential for many artificial intelligence systems,and are widely used in many fields,such as intelligent video surveillance,human computer interface,video analysis and retrieval,robotics,etc.Now,they are becoming hot research issues in both academia and industry.Recently,RGB based human action recognition has been studied extensively and achieved revolutionary results on benchmark datasets.But,there are still many challenges to be faced,such as the viewpoint change,significant variation in illumination condition,partial occlusion,pose ambiguity,etc.So,it is not sufficient to effectively character actions by extracting appearance and motion information from single RGB modality.With the recent advances in depth cameras and sensors,especially the Kinect sensor,depth cameras provide us new possibilities to obtain 3D human action data.By using the Kinect sensor,the information for two modalities:RGB and depth(including skeleton joint positions),can be obtained,which can benefit successful action recognition undoubtedly.Therefore,this make depth based action recognition become a hot topic in computer vision community.Motion segmentation is another classic problem in computer vision community.Under the affine camera model,motion segmentation from tracked feature points can be formulated as a subspace clustering problem,where each subspace corresponds to a different motion.The key issue for motion segmentation based on subspace clustering is the construction of an affinity matrix which should have a rigorous block-diagonal structure.Toward the above two mentioned issues,this dissertation investigated the RGB-D based action recognition and RGB based motion segmentation.The main contributions include the following:(1)Aiming at the problem of feature extraction from human action data in depth modality.we proposed a novel Joint Local Surface Geometric Feature(JLSGF)based on skeleton joints and depth information,which can jointly capture the geometric appearance and pose information of an action;a covariance descriptor is used to model the temporal evolution of an action in the constructed temporal pyramid structure,which can depict the characteristics of an action in the space-temporal domain.(2)In order to make full advantages of RGB and depth information,three descriptors,i.e.Histogram of Oriented Gradient(HOG),Histogram of Optical Flow(HOF)and Motion Boundary Histograms(MBH),are firstly extracted from RGB modality to encode the motion and appearance cues of dense trajectories drawn from human action.Then a novel two-stage multi-modality fusion framework is designed to combine the features extracted from depth modality.The proposed framework could take advantage of the complementary nature of the depth and visual RGB information to implement the RGB feature-level and modality-level fusion by comprehensively applying the motion,visual and geometric appearance,and the shape of trajectory cues.(3)To eliminate the correlation of encoding coefficients between classes and promote the coherence within each class simultaneously to the greatest extent,a novel dictionary learning model is proposed by simultaneous structured sparse representation and low-dimensional embedding,and an optimization algorithm is designed for solving the proposed model.This model could enhance the representational power of the dictionary via the learned low-dimensional projection matrix,and hence improving the robustness of the sparse representation model.(4)Different types of features possess distinct discrimination abilities for different actions.If they are combined with equal weights,it is possible to weaken the features having strong discriminability.Thus,a structured multi-view feature learning model is proposed to fuse features from both view-wise and individual view-points.Through this model,many features in the discriminative views and a small number of features in the non-discriminative views will learn large weights as the important and discriminative features.An optimization algorithm is also developed to efficiently solve the formulated convex optimization problem,which holds a theoretical guarantee to find the global optimum.(5)For the problem of motion segmentation based on subspace clustering,the block-diagonalization structures of affinity matrices constructed by traditional methods are not obvious.We propose a Laplacian structured representation model to enhance the representation-based clustering methods by importing local feature similarity prior information to guide the encoding process,and then develop an efficient Alternating Direction Method of Multipliers(ADMM)algorithm for optimization.Two improved subspace clustering methods,the Enhanced Sparse Subspace Clustering(E-SSC)and the Enhanced Low Rank Representation(E-LRR),are devised under this framework.(6)We conducted comprehensive and comparative experiments on six public and widely used benchmark action recognition datasets,achieved competitive results,and validated the effectiveness of the proposed methods.For motion segmentation,the proposed framework got improvements on two public benchmark datasets,outperforming alternative classic approaches by a notable margin.
Keywords/Search Tags:Action recognition, Depth camera, Motion segmentation, Structured representation
PDF Full Text Request
Related items