| In recent years,with the development of computer hardware and software technology,video data has grown exponentially on the Internet,and human behavior recognition based on video is a major component for effective management and analysis of video data.In this paper,we use the theory related to deep learning to focus on the problem of accurately recognizing human behavior in videos in two dimensions: time series and spatial features.The main research contents are as follows:(1)To address the problem of poor recognition by directly modeling the one-dimensional vector features outputted from the fully connected layer in time series,this paper uses a convolutional long and short-term memory neural network(ConvLSTM)to model the feature maps outputted from the convolutional layer in time series taking into account spatial information.In order to capture behavioral actions more accurately,the long short-term memory neural network(LSTM)is used for further video description of the features output from the ConvLSTM.Attention mechanisms are also incorporated into the feature extraction network to extract features that are useful for behavioral recognition,and the optimal timing of the incorporation is explored.(2)To solve the problem of low recognition accuracy caused by directly using the output features of the last moment of the LSTM network to represent the whole video features,this paper designs an aggregation network to do adaptive aggregation of the output of all time points of the LSTM: firstly,the input features are scanned to get the weight coefficients;secondly,the input features are integrated into the aggregation vector according to the weight coefficients and the aggregated feature vector until the scanning is completed to get the the final video description.The improved human behavior recognition model achieves an accuracy of 91.26% on the dataset UCF101,which is 5 percentage points better than the direct modeling approach using the last moment output features of the LSTM network to represent the whole video features.(3)To make full use of both spatial and temporal information,this paper uses weight fusion to fuse the recognition results of the feature map modeling using ConvLSTM and LSTM on the output of the convolutional layer with the recognition results of the onedimensional global feature modeling using LSTM and adaptive network on the output of the fully connected layer,and the fused features are fed into the classifier to obtain the final recognition results.The fusion of spatial and temporal information further improves the recognition effect,and the recognition accuracy reaches 95.68%. |