| Video content understanding is an important research topic in computer vision field.In the field of computer vision and machine learning,video understanding includes the following issues: action recognition,action detection,event detection,video summary,abnormal detection and etc.The results of these research can be directly used in real-world applications.For instance,the research results of abnormal event detection can be used in the scenes such as the fare evasion checking in subway station,the safety inspection on railway station entrances,the accident detection in traffic monitoring and etc.In this thesis,we mainly focus on action recognition,because other research topics like action detection or video summary are likely to depend on the results of action recognition.Action recognition is to classify well-segmented video shots,and in most action recognition benchmarks,each video contains only one person.For the limits of traditional video analysis methods,we employ deep learning in our video understanding algorithms.Deep learning has achieved great success in image recognition and detection.In the validation set of ImageNet,the top-5 accuracy of deep network has exceeded human beings.For powerful representation ability of deep neural networks,deep learning has been applied to pose detection,saliency detection and other tasks.In video processing field,although deep learning has obtained some good results,the existing work is only transfer the deep network to video tasks without considering the space-time property of videos.This thesis discusses how to employ deep networks in video processing tasks.The research topics include how to extract more powerful features with deep networks and how to design a better network architecture for dynamic pattern recognition.We propose two novel action recognition algorithms in this thesis.First,we combine LCD(Latent Concept Descriptor)with two-stream CNNs and extend LCD to its multi-resolution version,namely mLCD(Multiresolution Latent Concept Descriptor).The proposed approach encodes the last convolution layer of two-stream CNNs and classifies videos with SVM.Second,we propose TCNN(Temporal Convolutional Neural Network)for action recognition.This algorithm extracts visual features for each video frame by two-stream CNNs,and then concatenates these feature vectors into a large feature image.We can use another CNN to classify this new image.Compared with LSTM,our model can achieve better accuracy on action recognition benchmarks.In this thesis,we conduct a large amount of experiments on public datasets such as Hollywood2,Olympic Sports and UCF101.The experimental results show that our mLCD algorithm can achieve better accuracy than state-of-the-art algorithms on Hollywood2 and Olympic Sports datasets.On UCF101 benchmark,TCNN can achieve comparable results compared with stateof-the-art algorithms and the experimental results validate the effectiveness of our proposed algorithm. |