| With the increasing maturity of internet technologies,especially the promotion of mobile internet applications and the popularity of smartphones,digital cameras,and surveillance cameras,video has become an indispensable form of media in people’s daily production and life.Video services show a rapid development trend.The ever-increasing number of videos and unhealthy video content undoubtedly bring unprecedented pressure on the storage,analysis,and supervision of video.Deep learning has shown great advantages in the field of computer vision,and has realized unreachable effect of traditional methods in a series of application scenarios such as video description,fine-grained image recognition and so on.Therefore,this thesis uses a series of commonly used deep learning network models,and studies the spatial-temporal fusion characteristics of the video,introduces the attention mechanism analogous to human vision,and improves the basic deep LSTM model to study accurate and efficient video behavior recognition techniques.Different from the traditional action recognition technology,deep learning shows powerful feature extraction capabilities,and can extract powerful feature of high degree differentiation adapted to tasks.In order to study video action recognition technology based on deep learning,the work done by the paper is summarized as follows:Firstly based on the spatial-temporal fusion characteristics of video,we extract the spatial and temporal features of video respectively,and integrate them into temporal-spatial fusion features.Then we imitate the attention mechanism of human visual system,and propose a spatial-temporal fusion model based on attention mechanism.Based on the video segment,the model focuses on the key frames of the video segment by assigning greater weight to the key frames in the video segment and reduces the interference of the redundant information on the video action recognition.Then in order to improve the performance of basic deep LSTM network model,a spatial-temporal fusion model based on fast forward connection and a spatial-temporal fusion model based on temporal multi scale are proposed.By optimizing the information propagation in deep LSTM networks and exploit temporal multi-scale video content,we further improve the recognition performance of basic deep LSTM network model.Finally on UCF-101 and HMDB-51 datasets,we use TensorFlow to conduct experimental analysis on spatial-temporal fusion model based on attention mechanism,spatial-temporal fusion model based on fast forward connection,and spatial-temporal fusion model based on temporal multi scale.Experiment results show that the spatial-temporal fusion model based on attention mechanism,spatial-temporal fusion model based on fast-forward connection,and spatial-temporal fusion model based on temporal multi scale proposed in this paper can improve the accuracy of video action recognition.At the same time,the specific recognition accuracy of certain types of video content of the above three models is also analyzed.Two segments of video content are used for attention visualization analysis on the spatial-temporal fusion model based on the attention-mechanism. |