| Action recognition is one of the most important field of computer vision.It can be applied in many fields,such as security monitoring,abnormal event detection,video information classification,human-computer interaction,and has a wide application prospect.With the explosive growth of video data on the Internet,how to effectively understand and analyze video data for various purpose is very important.Traditional artificial feature extraction methods have many limitations in dealing with massive video data.The analysis of large-scale video data still needs to be solved.With the development of computer hardware and the rapid rise of deep learning.Deep learning,especially convolutional neural networks,have been applied to various computer vision tasks and have achieved a series of remarkable results.At present,although convolutional neural networks can have excellent effects in tasks such as images classification and detection,they are generally performed in action recognition tasks.Since the action recognition in the video is the spatio-temporal three-dimensional signal,the complexity is also higher than the two-dimensional image recognition.It is of great significance to make convolutional neural networks more efficient and accurate for human action recognition.This paper analyzes the difference of several feature extraction methods used in several action recognition and focuses on the feature extraction method based on deep learning.Compared with the traditional 2D convolution kernel only processing a single frame image,and lack of temporal information.The 3D convolution with spatio-temporal three-dimensional filters seem like a natural approach to video modeling.Although 3D convolutional neural networks are more suitable for video analysis than 2D convolutional neural networks,it has many problems in practical applications.The reasons for this failure have been the relatively small data-scale of video datasets comparing with 2D convolutional neural networks having large-scale datasets.What is worse,immense number of parameters may cause overfitting.This paper proposes an improved 3D residual neural network combines the SE block for human action recognition.The residual network mitigates model degradation due to excessive network layer counts through residual learning and identity shortcut connection and can reduce the number of parameters.The structure of SENet is introduced which can improve the quality of representations produced by a network through explicitly modeling the interdependencies between the channels of convolutional features.Finally,we conducted experiments on UCF-101 action dataset and HMDB-51 action dataset.For video samples from the two datasets,a 16-frame clip is generated around the selected temporal position as input,and datasets are expanded by random clipping.The action classifiers are obtained through end-to-end training process.The experimental results show that the improved 3D residual neural network can effectively improve the recognition accuracy.Finally,after using the pre-training model,the recognition results on UCF-101 dataset and HMDB-51 dataset are better than others without optical flow,that verifies the effectiveness of the proposed human action recognition algorithm based on 3D residual neural network. |