| With the rapid development of computer vision and deep learning algorithms,video spatio-temporal action detection based on deep learning has become a research hotspot.Spatio-temporal action detection algorithm aims to locate spatial and temporal position of the action in videos,classify the type of action,and finally generate an Action Tube.Based on the existing algorithms,this thesis proposes two action temporal detection algorithms to improve the speed and accuracy of existing algorithms,and also designs and implements a video spatio-temporal action detection system.The main work and innovations of this thesis are as follows:(1)This thesis proposes an action temporal detection algorithm based on long-term correlation,which aims to solve the complex and time-consuming problem of the existing algorithm.Firstly,the algorithm dectects human actions for each frame of the video to complete action spatial localization and classification.Nextly,the action temporal detection algorithm based on the long-term correlation connects the frame-level detection results and locates the time position of the action,and finally generates the Action Tube.The action temporal detection algorithm based on long-term correlation extends time candidate duration during linking frame-level results,thus the linking candidates will not be limited in the continuous fr-ames.The experimental results on the UCF101 dataset show that the algorithm can effectively improve the running speed of the algorithm and improve the accuracy of spatio-temporal action detection.Compared with the existing algorithm,the calculation speed is increased by 28.9%and the accuracy is increased by 1.35%on average.(2)This thesis proposes an action temporal detection algorithm based on multi-feature fusion,which aims to solve the problem of the spatial jump between continuous frames in the result of long-term correlation action temporal detection algorithm.Based on the action temporal detection algorithm with long-term correlation,this algorithm combines the scores of frame-level detections and intersection-over-union of consecutive detection boxes as linking standard,and link the frame-level detection results with this standard to generate the Action Tube.The experimental results on the UCF101 dataset show that the algorithm can effectively improve the accuracy of spatio-temporal action detection.Compared with the long-time correlation and Viterbi-based action temporal detection algorithm,the accuracy rate is increased by 1.48%and 2.83%respectively.(3)This thesis designs and implements a system for video spatio-temporal action detection.This system consists of three modules:optical flow extraction,spatial action detection and temporal action detection.In the optical flow extraction module,this system uses both traditional algorithms and algorithms based on deep learning to extract optical flows.In the spatial action detection module,this system uses an action spatial localiztion and recognition network based on deep learning to detect frame-level actions.In the temporal action detection module,the temporal action detection algorithms based on long-term correlation and multi-feature fusion are used respectively,and the system visualize the results.In addition,this thesis presents evaluation indicators for this system and evaluates the performance of the system. |