| Human action recognition is a technology that uses various sensor data to obtain the type of human actions,and it is one of the most important research directions in the field of computer vision.With the continuous improvement on internet bandwidth,industries such as online short video and webcasting have gradually emerged.The large amount of video data generated by these urgently needs to be classified and managed reasonably.These tasks usually require action recognition as a basis.In addition,action recognition also has broad application scenarios in the fields of security monitoring,autonomous driving,and medical care,it can solve the problems of human resource shortage and slow response to manual operations in these fields.As deep learning shines in various fields of artificial intelligence,more and more computer vision tasks begin to use deep learning technologies and have achieved many significant results.However,considering the high computational cost and low recognition accuracy,action recognition based on deep learning has not yet achieved satisfactory results at this stage.Therefore,how to improve the accuracy of action recognition and reduce the amount of calculation has very important research values and practical significance.This thesis has conducted research and improvement on existing problems in action recognition and has made significant improvements in aspects of accuracy and calculation speed.The main work of this thesis are summarized as follows:First,on the basis of comparing various state-of-the-art action recognition methods,this thesis proposes an action recognition network architecture based on multi-scale boundary representation.It uses the boundary changes of moving objects to replace optical flow for motion representation.It solves the problems of the time-consuming calculation and large storage occupation of optical flow image effectively,so that this method is capable of being applied to real-time application scenarios that require low calculation delay.The multi-scale feature enables the network to adaptively learn the changes between video frames,improves the network’s robustness to different ranges of changes,and thereby improves network performance.The proposed method is tested on the SomethingSomething-V1 dataset which pays more attention to actions and achieves the recognition rate of 54.80%,which is approximate to the state-of-the-art method(55.16%)in terms of accuracy and has more advantages in calculation speed.Second,this thesis proposes a temporal segment motion representation module and a temporal information fusion module in order to effectively utilize action information and temporal information,they both strengthen the network’s ability to extract and fusing temporal information.For an action recognition task,short-term action information and midto-long-term temporal information are both the key point to classification.Based on the idea of the Motion Squeeze module,temporal segment motion representation module is computationally efficient and can use the intermediate results of boundary representation calculations to improve network performance with very little computational consumption;Temporal information fusion module has better recognition accuracy improvements and it integrates information from different temporal segments by learning to weight.This method achieves the recognition rate of 54.32%on the Something-Something-V1 dataset. |