Font Size: a A A

Research On Action Recognition Method Based On Spatiotemporal Feature Fusion And Knowledge Distillation Technology

Posted on:2023-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:W LiangFull Text:PDF
GTID:2568306833989189Subject:Engineering
Abstract/Summary:PDF Full Text Request
Action recognition is an important research topic in computer vision,which has a wealth of application scenarios,including behavior analysis,video retrieval,human-computer interaction,game entertainment,etc.Existing solutions for video-based action recognition model algorithms usually have two issues: 1)Because the video is made up of sequence pictures,video’s temporal and spatial dimensions are not equal in importance.2)On the other hand,while many schemes for video temporal features extraction can reach a finer level,it still lacks the distinction of the visual rhythm of the action;if spatial and time series features are put into the classifier in equal proportions for classification,this will result in an imbalance of spatiotemporal features and affect the classification results;We present a spatiotemporal feature pyramid network based on 3D convolution based on the aforementioned two difficulties.The spatiotemporal feature pyramid widens the receptive area of the spatial and temporal dimensions,solving the problem that the model lacks the visual rhythm differentiation of action.The developed multilayer feature extraction module assures that the spatiotemporal characteristics input into the classifier are reasonably balanced,addressing the issue of video spatiotemporal feature imbalance.On the public dataset Kinetics-400,we constructed a 3D convolution-based spatiotemporal feature pyramid network,and its accuracy achieved the maximum 76.68% in top-1 and 93.18% in top-5,which is a substantial advantage over other techniques.To be used in real-world applications,most algorithmic models must be installed on resource-constrained devices.However,The model’s vast size makes deployment problematic,which is a prevalent issue in the field of deep learning.As a result,model compression is extremely significant for model optimization.We use the model compression approach of knowledge distillation to the action recognition algorithm based on video for model optimization.We design a layered feature distillation module to address the uniqueness of the action recognition challenge.This module mainly divides the features in the time dimension and the spatial dimension,and compares them respectively to ensure that the output features of each layer of the student model are as close as possible to the output features of the corresponding teacher model.Its core is spatiotemporal feature transfer loss function,which fully considers the transfer of video temporal information and spatial information in knowledge distillation.In the experiment,we use 3D Resnet with different layers as the feature extraction network in the public dataset UCF101.The results show that using multilayer feature distillation module for training can not only improve the training efficiency of the model,but also improve the recognition accuracy of the model,up to 4.4%.
Keywords/Search Tags:action recognition, spatiotemporal features, knowledge distillation, spatiotemporal feature transfer loss
PDF Full Text Request
Related items