| Behavior recognition,also known as action recognition,is an important research task in the current computer vision field.Its main goal is to determine the category of human actions in a video clip.Due to the development of deep learning,behavior recognition methods based on deep learning have become a current research hotspot,and because of the complexity of video data,behavior recognition methods based on deep learning are all different.Among them,the behavior recognition method based on 3D convolutional network can simultaneously model the time information and spatial information of the video,showing a good recognition effect.This thesis is based on the behavior recognition method based on3 D convolutional network.The following work:(1)A two-stage behavior recognition method based on 3D-ResNet and behavior semantics is proposed.This method first introduces the semantic information of the behavior category labels,and divides the similar behavior categories into the same category through the semantic similarity between the behavior category labels.Then the recognition process is divided into two stages.The first stage uses the similarity cross-entropy loss function to guide the network model training so that the network can be roughly classified;the second stage uses the cross-entropy loss function to train the network to make the network able to finely classify videos.Finally,a comparative experiment on the behavior recognition data set UCF101 and HMDB51 data set shows that this method is better than the 3D-ResNet34 method,and the final recognition accuracy is increased by 2.88% and 0.89% respectively.(2)In 3D convolutional network,when the sequence of input video frames is short,the behavioral semantic information in part time will be lost and the final recognition result will be affected.Therefore,a behavior recognition method based on 3D convolution and multilevel semantic information fusion is proposed.This method uses a multi-level semantic information fusion module to gather the temporal semantic information in each middle layer feature in the 3D convolutional network in order to prevent the loss of certain temporal semantic information that is critical to the recognition of the behavior category during the3 D convolution process.The semantic information is then fused with the features extracted by the 3D convolutional network to achieve the effect of improving the recognition accuracy.Through the ablation experiments and comparative experiments on the UCF101 and HMDB51 data sets,the effectiveness of the method is demonstrated.After 3D-ResNet34 is combined with the multi-level semantic information fusion module,the accuracy of the two data sets has been improved 2.74% and 0.78% respectively.(3)Aiming at the problem that the input video frames selected by the existing methods will have certain repetitive information,a behavior recognition method based on 3D convolution and key frame selection is proposed.This method uses dictionary learning to sparse the video frames.A key frame selection algorithm is designed to make the input video frame sequence contain more information without changing the length,which can better represent the entire action,and then use the selected key frame sequence as a 3D convolutional network The input data for behavior recognition.Through comparative experiments on the UCF101 and HMDB51 data sets,the effectiveness of the method is demonstrated.After 3D-ResNet34 uses the key frame sequence as input,the recognition accuracy on the above two data sets is increased by 1.88% and 0.35 % respectively. |