| Human action recognition is to identify the action category of people in the video,which has a wide range of application prospects,such as railway station monitoring,intelligent medical robot,examination cheating behavior analysis,etc.The traditional method of action recognition needs to extract the features of video manually for classification,which has a large workload and low accuracy.The method based on deep learning can automatically extract the features in the video and realize the high accuracy.There are two main problems in the action recognition method based on deep learning: First,the existing method can not effectively use the key spatiotemporal information in the video,and there a lot of redundant spatiotemporal information in the feature.Second,the existing methods lack the reasoning of the key spatiotemporal information in interactive actions,and there is still room to improve the recognition rate.Therefore,based on the convolution neural network,thesis proposes two methods of action recognition to improve the performance of action recognition.First of all,thesis proposed an action recognition method based on two-stream network with spatiotemporal attention mechanism.Firstly,thesis introduced the channel attention mechanism to the two-stream basic network,and calibrated the channel information by modeling the dependencies between feature channels to improve the ability of future expression.Secondly,thesis proposed a CNN-based temporal attention model to learn the attention score of each frame with fewer parameters,which can focuses on the frames with significant amplitude of motion.At the same time,a multi-spatial attention model was proposed,which calculates the attention score of each position in frame from different angles to extract motion saliency areas.Then,temporal and spatial features were fused to further enhance the feature representation of video.Finally,the fused features were input into the classification network,and the results of each stream are fused according to different weights to obtain the recognition results.Secondly,considering that there are many interactive actions in the action recognition video,the performance of recognition can be further improved by performing the relationship reasoning between interactive objects and different frames.Thesis proposes an action recognition method based on graph convolutional network for two-stream heterogeneous spatiotemporal relationship network.In this method,different network structures are used to extract features in appearance stream and motion stream respectively to obtain more abundant video information.In order to obtain the information of multiple objects related to the action,a channel grouping attention network is proposed,which clusters the regions of each frame according to the channel information.Then,the different objects in each frame are regarded as nodes in the graph.After defining the adjacency relationship,the relationship between different objects is modeled by graph convolution network,and then the relationship between different frames of the video is inferred by graph convolution neural network to model the timing relationship of the video,so as to improve the accuracy of action recognition.Finally,the two methods are tested in HMDB51 datasets and UCF101 datasets.The experimental results show that the proposed method based on two-stream network with spatiotemporal attention mechanism can make full use of the key spatiotemporal information in the video and recognize the actions more effectively.The action recognition method based on graph convolution network of two-stream heterogeneous spatiotemporal relation network can significantly model the object information related to action in video,mine the relationship between different frames,effectively identify the interactive actions in the dataset,and improve the accuracy of action recognition. |