| Human action recognition is an important research task in the field of computer vision.It has a wide range of applications in video surveillance,somatosensory games,action analysis and video search.In recent years,deep learning represented by convolutional neural networks has made breakthrough in human action recognition.At the same time,the attention mechanism has attract researchers’ interest and it has achieved good results in areas such as natural language processing.In order to further improve the accuracy of human action recognition,this paper learns from the attention mechanism of the human brain and combines it with a convolutional neural network to propose two methods of human action recognition based on the attention mechanism.One method based on an attention model and the other method based on a multi-layer attention model.In order to assign more attention to the key joints in skeleton sequences,this paper proposed a method based on an attention model.The model of this method consists of two parallel sub-networks: a deep convolutional sub-network and an attention sub-network.The deep convolutional sub-network extracts features from the skeleton sequence and maps them to the label space to obtain the a prediction vector.The attention sub-network extracts motion saliency vectors for each joint of the human body according to the action trajectory and maps them to the label space to get another prediction vector.Finally,the two prediction vectors are fused to obtain the recognition accuracy.In order to solve the problem that lacking communication between the deep convolutional sub-network and the attention sub-network in the last method,this paper proposed a method based on a multi-layer attention model.The model includes two interactive sub-networks: a deep convolutional sub-network and a multi-layer attention sub-network.The deep convolutional sub-network extracts features from the skeleton sequence to output multiple depth feature maps of different scales.Simultaneously,the multi-layer attention sub-network performs transposed convolutions of motion saliency vectors to obtain corresponding attention weight maps.Finally,the depth feature maps and attention weight maps are fused using residual connection to obtain the recognition accuracy.The two methods proposed in this paper have been varified on three international human action datasets: SYSU-3D,UTD-MHAD and NTU-RGB + D.The recognition accuracy of the method based on an attention model is 80.36%,94.37% and 86.18%,92.21% respectively.The method based on a multi-layer attention model increased 2.97%,2.41% and 0.53%,0.46% on the basis of the first method.Experimental results show that the two methods proposed in this paper have greatly improved the accuracy of human action recognition,surpassing most advanced methods. |