| With the development of deep learning,the performance of video action recognition has been greatly improved,and it has been widely used in real life.However,the existing methods usually lead to higher model complexity and increased computational cost due to modeling video action temporal information.Moreover,as more and more long and complex video datasets are proposed,the long-term action have longer temporal dimension and more complex motion information.This brings new challenges to existing methods in capturing long-term dependencies over the entire temporal scale,and existing methods ignore the connection between spatial and temporal information.In addition,video action recognition models with deeper and deeper network structures are increasingly conflicting with smaller datasets,and video data in existing datasets is usually ideal.However,existing data augmentation methods have problems such as loss of the semantic information of important targets when improving the generalization ability and robustness of the model.Therefore,this thesis proposes new methods for Video action temporal feature learning,Long-term action recognition and Data augmentation.The main work is as follows:(1)Aiming at the problems of high model complexity and increased computational cost caused by temporal information modeling,a method for video action temporal feature learning based on dynamic temporal shift is proposed.There are differences in the relationship between features in different channel dimensions,and shifting the channels with close feature relationships along the time dimension can obtain effective temporal information.Therefore,construct a two-layer full connected network,learn the relationship between the features of different time dimensions in each channel dimension,and obtain the attention distribution of different levels of channels.Then,design a dynamic temporal shift module to dynamically select channels with an attention value greater than the threshold,and perform temporal shift along the time dimension to obtain temporal features.Finally,the network parameters of the fixed two-layer fully connected network are used to learn the global spatiotemporal features,and are fused with the temporal features to further enhance the action feature representation.This method improves the recognition accuracy on short and uniform datasets with low model complexity.(2)Aiming at the problems that existing methods cannot capture the long-term dependencies of long-term action and segment the relationship between long-term action spatial information and temporal information,a long-term action recognition method based on Two-MLPs is proposed.A network layer composed of Multi-Layer Perceptron(MLP)is designed to capture the long-term dependencies of long-term action features along the spatial and temporal dimensions,and MLP abandons the inductive bias,allowing the network to learn features completely autonomously.A norm penalty term is added to the loss function to constrain the network learning direction to explore the connection between spatial and temporal information,and the proximal gradient algorithm is used to solve non-convex programming problems.The method achieves good longterm action recognition accuracy on long and complex datasets,and experiments verify the effectiveness of the connection between spatial and temporal information.(3)Aiming at the problems of loss of CNN translation invariance and important target semantic information when existing data augmentation methods alleviate the problem of overfitting,a video data augmentation method based on improved bounding box regression is proposed.First,a new measure SIo U is proposed and Beta distribution is introduced to improve the existing bounding box regression method;Then a data augmentation network is built to capture the target area in the original image and preliminarily enhance the original image,and copy the target area to the preliminarily enhanced image to generate a new sample which avoid the loss of important target semantic information.Experiments show that this method can improve the performance of existing video action recognition methods. |