| In recent years,with the rapid development of the era of big data and the popularity of smart mobile devices,video data has become an important carrier for people to obtain and share information.Video data contains rich information such as sound,image and text,so it is widely used in different fields of research.Among them,action recognition has become one of the current research hotspots due to its wide application prospects.The action recognition model based on self-attention mechanism has attracted much attention due to its excellent performance,but there are still some problems in the existing models.First,the scale of the feature receptive field is single,which leads to the lack of attention to the action information with large span and long duration.Second,the feature space distribution is anisotropic,and the feature information is unevenly distributed on the unit sphere,which affects the classification performance.Third,the model classification layer has many training parameters,resulting in time-consuming training process.In order to improve the defects of the existing methods,this thesis has carried out systematic research on the above three problems,and proposed a new video action recognition model.The main contents of this thesis are as follows:(1)Multi-scale feature fusion: Most of the current models learn the feature representation of the input video at a fixed scale,and fail to make full use of the complementarity and correlation of feature information between different scales.Therefore,this thesis proposes a action recognition model based on multi-scale global feature fusion.The model extracts local features of video from different scales,and establishes the global correlation between multi-scale local features.Then,the proposed feature fusion module is used to correlate multi-scale feature information to obtain richer feature representation.Finally,through sufficient comparative experiments,this thesis proves that the proposed model has better performance than previous models.(2)Multi-view contrastive learning: Aiming at the problem of anisotropic distribution of feature space,this thesis proposes a action recognition model based on multi-view global feature contrastive learning.On the basis of supervised learning,a contrastive learning framework is introduced to stimulate the uniform distribution of feature space and the compact distribution of similar features,so as to obtain a more discriminative feature representation.At the same time,this thesis constructs a new loss function to control the feature distribution of the model,so as to improve the recognition accuracy of the model.Finally,the detailed simulation results show that the features extracted by the contrastive learning framework have good alignment and consistency,which can effectively improve the recognition accuracy of the model.(3)Fast classification of wide learning: On the basis of the previous work,this thesis introduces a wide learning system to replace the classification layer in the original model for the problem of many training parameters and long time in the classification layer of the existing model.The wide learning system has the characteristics of fast classification and general approximation.The model can be constructed in a non-iterative way.Therefore,the extracted feature representation is sent to the wide learning system for classification,which accelerates the training speed of the model and improves the nonlinear fitting ability of the classification layer to improve the model recognition performance.In summary,this thesis focuses on some problems existing in the action recognition model based on self-attention mechanism,and proposes corresponding solutions to these problems.The three improved methods proposed have more accurate recognition accuracy and faster classification speed than the previous action recognition models,which lays a solid foundation for future practical applications. |