| Action recognition is the technology of understanding people’s behavior and classification from video or image sequence, this thesis uses the deep learning approach for action recognition. Deep learning is very popular domain in machine learning in recent years. Convolutional neural network(CNN), as a representative of the deep learning network, has better recognition performance than the traditional neural network and the method is a kind of end-to-end recognition method that need not manually design features, and it has attracted a lot of people to study, and have made a success in some areas of computer vision. It has translation invariance and scale invariance, and its calculation and has a lot of similarities with the mammalian visual system.First of all, this thesis introduces the theoretical basis of convolutional neural network. The traditional neural network is introduced first, and then we introduce convolution neural networks, and convolution layer and the pooling layer are depicted. And then we introduce a kind of structure of convolution neural network on the small database(that is LeNet-5), and the experimental result of MNIST database. Then, the thesis introduces the ImageNet for large database, this kind of network has some differences in the structure with LeNet-5, including the Re LU nonlinear activation function, the maximum overlap sampling, the softmax classifier, etc. Finally, we briefly illustrate the usage of convolutional neural network in the video.Secondly, this thesis introduces 3D CNN network structure used for video, in which there are two convolution layers, two pooling layers, a full connection layer and an output layer, and using the five channels(a pixel channel, two gradient channels, two optical flow channels)in the input layer. Then, this thesis describes the improved 3D CNN in detail. The improved 3D CNN has seven channels(one pixel, four Gabor filter channel, two optical flow), and its number of kernel is more than 3D CNN, and it samplings in time domain in the pooling layer.This thesis expounds a new design of the network, using the Network in the Network(NIN) technique, the temporal and spatial pyramid technique, the ReLU nonlinear activation function and the softmax classifier. This chapter firstly introduces NIN technique, which is the nonlinear extension of linear convolution, can learn the nonlinear features. And then we introduce the temporal and spatial pyramid technique, in which input of network can be videos with different resolutions and number of frames. Then we describe in detail the overall structure of the new network. Finally, the thesis analyzes the new network’s advantages of 3D CNN.Finally, the improved 3D CNN and the new network are analyzed, including the result of the experiment on KTH database, the analysis of time complexity and space complexity of the network, and the visualization analysis of the feature map. Then, we analyze the advantages and disadvantages of two kinds of networks and their scope of application. |