Font Size: a A A

Research On Video Behavior Recognition Based On Temporal Action Localization

Posted on:2024-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:C H MaFull Text:PDF
GTID:2568307103472304Subject:New generation electronic information technology (including quantum technology, etc.)
Abstract/Summary:PDF Full Text Request
With the development of Internet technology,dynamic multimedia with video as the main part is developing rapidly.It is essential to understand and analyze these videos.Currently,whether it is the extraction of exciting and high-energy clips from athletic event videos or the locking of crimes under surveillance in criminal investigation cases,it needs to be achieved manually,resulting in a waste of human and material resources.Therefore,it is a key challenge in the field of video understanding to quickly and accurately locate temporal actions in order to play an important role in areas such as video mining,intelligent editing and smart recognition.Based on the above analysis,this paper addresses the temporal action localization algorithm and proposes two temporal action localization algorithms,respectively,with the following main contributions:(1)To address the problems that the boundaries of some actions are blurred,the beginning and end boundaries are very similar,and the existing temporal action detection models capture short-term and long-term dependent information in a single way,a multiscale context modeling network algorithm model is proposed in this paper.Firstly,a multi-resolution context aggregation module(MRCA)is designed to divide the preextracted video features into two feature streams with different time scales,called highresolution stream and low-resolution stream.They are independently input into the multiresolution context aggregation module to model temporal relationships on different time scales in a structured way,and to aggregate global temporal contextual relationships through a multi-headed self-attention mechanism.Secondly,an information enhancement module(IE)is designed to enhance the aggregation of long-and short-range contextual information and increase the diversity and robustness of contextual information.Finally,the data processed by the information enhancement module is aggregated with high-and low-resolution streams,and the final result is output after post-processing.After several experiments,it is proved that this model achieves good recognition accuracy on all three public data sets.(2)To address the problems that most of the current models only utilize temporal context and most of the existing cases that combine temporal and semantic contexts into video features have a single contextual expression and contain insufficient information,this paper proposes a pyramid structure model with multiple temporal resolutions.Firstly,a temporal-semantic context aggregation module(TSCF)is designed to assign different attention weights to temporal contexts and dynamically aggregate semantically similar segments based on dynamic edge convolution definitions,and the two are jointly aggregated into the video features.Secondly,for the problem of large differences in time span between different actions in the video,the local-global attention module(LGAM)is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relationships.The reduction of computational effort is achieved by modifying the convolution to reduce the redundant representation of the convolution kernel and redeploying the arithmetic power at the microscopic granularity.Experiments on three large-scale public datasets show that the accuracy of this model is improved and shows excellent performance compared with other mainstream methods.
Keywords/Search Tags:Action Recognition, Temporal Action Localization, Contextual Aggregation, Attention Mechanisms, Semantic Context
PDF Full Text Request
Related items