| Video action recognition is a direction of computer vision,aiming at intelligent analysis of video data to make judgments on the action of the person.Video action recognition technology has a wide range of application scenarios,can be applied in video retrieval,intelligent monitoring,human-computer interaction,intelligent home and many other fields,rational use is conducive to improving people’s production and life quality.Video data contains temporal and spatial information,both of which are crucial for video action recognition.Therefore,it is necessary to fully explore the spatio-temporal features of videos.Currently,the methods used include two-stream networks to extract spatial features and temporal features separately,and using three-dimensional convolutional neural networks to simultaneously model spatio-temporal information,etc.However,these methods have some problems,such as lack of close relation between temporal and spatial,low utilization rate of time series and large amount of computation.Therefore,the question of how to effectively extract spatial and temporal features is still a direction that requires further research.This paper studies how to efficiently extract spatio-temporal features from video data based on two-dimensional convolutional neural networks from different perspectives,and the main research achievements are as follows:(1)An attention-based multi-feature aggregation module for video action recognition is proposed.In video action recognition,spatial and temporal features play complementary roles in the process of network discrimination,but the traditional two-stream network methods weaken the correlation between temporal and spatial information.To solve this problem,a multi-feature aggregation module based on attention is proposed.The module can obtain the spatio-temporal convergence characteristics with good expressive force.First,separable convolution is used to model spatio-temporal information.Then,global context relations are captured by matrix multiplication to establish connections between pixels.Then,the features containing spatio-temporal aggregation information are weighted to carry out adaptive learning in channel dimension,emphasizing the channels containing important information and weakening the minor channel information.Finally,the module was embedded into the ResNet-50 network to build a low-complexity and high-accuracy video action recognition network.The proposed method showed competitive results in UCF101 and HMDB51 data sets,with recognition results reaching 95.3% and 69.2%,respectively.(2)A video action recognition method of spatio-temporal attention based on temproal adaptation is proposed.In video data,the feature information is different at different times,so that not all features can be processed in the same way,otherwise this would lead to an uneven distribution of resources or even to an impairment of the accuracy of the final recognition.To address this issue,average pooling and maximum pooling operations are used to obtain the global and local information of the temporal dimension,and then full join operations are used to obtain the weight of the temporal dimension.Then,according to the temporal weight value,the feature graph is separated in temporal dimension and divided into important feature class and minor feature class.Subsequently,spatio-temporal features can be extracted directly by developing a spatio-temporal attention based on the energy function.Finally,the adaptive fusion of these two spatio-temporal features is carried out.This method can significantly improve the performance of the network,and the accuracy of UCF101 data set reached 95.8%,and the accuracy of HMDB51 data set was 69.7%(3)This paper proposes a method of long-short temporal video action recognition based on semantic edge.The behaviors in the video have obvious time sequence characteristics,and different rates and orders of the same behavior will directly affect the final discrimination.In order to solve this problem,semantic edge features are proposed to better express motion information.Semantic edge features are obtained by convolution on feature graphs,and the initialization template of its convolution kernel is Gaussian Laplacian operator.Then,at the semantic edge feature level,multiple different dilated convolutions are used to obtain temporal information at different time intervals,so as to realize long distance and short distance temporal information modeling.At the same time,temporal attention is constructed to learn the importance of different time points and further excavate temporal information.Lastly,the above modules are embedded into the residual block of ResNet-50 network to supplement the temporal information.This method improves the utilization of timing information with an accuracy of 95.8% on UCF101 and 70.4% on HMDB51. |