| Human action recognition is widely used in various fields such as video understanding,video surveillance,and human-computer interaction.It is a popular research direction in the field of computer vision.Compared with traditional RGB image or video data,human skeleton data are of small data volume,having background invariance characteristics and with rich information.So skeletonbased action recognition draws broad attention.However,most of the existing methods are limited to the exploration of supervised learning,and training them requires a large amount of labeled data.The process of annotating training data is tedious and expensive.At the same time,similar actions are often mislabeled,and the recognition speed is slow.Besides,the labeled action category is fixed,and the scalability is poor.So how to use effective unsupervised methods to extract the action representation,and how to simply expand the types of recognizable actions are still urgent problems.In terms of algorithm,this paper proposes an action representation learning network based on skeletal sequence.By using asymmetric spatial and temporal augmentations,this method combines the constatives and pre-task learning in one framework,and the training process carries out in a purely self-supervised manner.Even without the assistance of any labeled data,the network can fully extract the discriminative representation with spatiotemporal information.In addition,this method uses graph-based convolution as the backbone to extract the natural spatial information in the skeleton.Meanwhile,several skeletonspecific spatial and temporal augmentations for generating positive pairs which encourage the model to focus on the spatiotemporal information of skeletonbased action sequences are introduced,which ignoring confounding factors such as viewpoint and the exact joint positions.After the trained model is obtained,the trained representation extraction network for human action recognition is used in the representation extraction stage to extract the action features from the original skeleton sequence.The experimental results show that the proposed method of skeletal sequence augmentation,as well as the unsupervised training method combined with contrast learning and pretext learning,and the spatiotemporal information mining encoder based on the spatial-temporal graph convolution network,have achieved advanced results in both the unsupervised and semi-supervised settings on the public datasets NTU-RGB+D 60,NTURGB+D 120 and North-Western UCLA.In engineering,this paper designs a math-based skeleton representation extraction system,and takes gesture recognition as an example which is harder than human action recognition to carry out research.A set of coding rules is designed to extract the hand’s representation by this method.The gesture recognition is carried out according to the defined gesture code matching to achieve real-time and robust gesture recognition.At the same time,the gesture database can be modified in real time to realize the expansion of recognizable actions. |