| Human action recognition has a wide range of application values in various fields such as intelligent transportation,intelligent security,intelligent home,human-computer interaction,VR/AR and other fields.Due to the diverse types of human actions and the involvement of interactions with environmental objects in many actions,research on human action and behavior recognition poses challenges in computer vision,including high complexity,susceptibility to interference,and significant impact from environmental factors.To address the issue of insufficient data extraction and the difficulty in capturing global information in graph convolutional neural networks(GCNs),we propose a GCN-based action recognition model that integrates spatio-temporal self-attention mechanisms.We design a novel spatio-temporal self-attention block,which decouples the temporal and spatial dimensions of the input data and extracts their respective features.This module is embedded in between the convolutional layers of an adaptive GCN,forming the integrated spatio-temporal self-attention mechanism.Based on this mechanism,we build a two-stream GCN-based action recognition model that takes human keypoints data and bone data as input sources.The model achieves good recognition accuracy on both the NTU-RGB+D and Kinetics-400 datasets,outperforming several classic models.Subsequently,through multiple sets of experiments,the contributions of each part in the model to the overall performance improvement were elaborated in detail,and the reasons for the performance improvement were explained.The experimental results demonstrate that the new model we designed is correct and reasonable,and the model’s performance has achieved the expected level.Finally,we also developed a human action recognition application to test the model and demonstrate its usability.The human action recognition application has two functions: real-time recognition and local video recognition,taking input from a connected camera or a local video file,respectively.The application classifies actions in real-time or after reading the entire video and displays the original video,the video with marked keypoints after processing using human pose estimation algorithms,and the classification results on the page,completing the entire process from keypoint extraction,action recognition,to final result display. |