Font Size: a A A

Deep Learning Based Video Temporal Event Detection

Posted on:2023-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:J M ZhangFull Text:PDF
GTID:2558306914971169Subject:(degree of mechanical engineering)
Abstract/Summary:PDF Full Text Request
Video temporal event detection aims to find events’ start and end temporal position in untrimmed videos and give corresponding categories.It’s the basis of visual intelligent human-robot interaction,could assist other video understanding tasks,and be used for video intelligent analysis,review,recommendation,and other scenes.With the explosion of video data and the development of deep learning technology,video temporal event detection has been a significant breakthrough.However,many problems still exist due to the complexity of untrimmed videos and the challenge of temporal event detection.In this paper,we explore several of these problems as follows:(1)Most of the current temporal event detection methods use classsemantics representation extracted by the video classification model.There is a gap between the class-semantics video representations of untrimmed videos and the raw videos because untrimmed videos contain a wide range of background clips.Location-semantics representation has a better background expression effect because its encoder uses untrimmed video for training,but it is resource-consuming.Therefore,from the perspective of ease of use,it is necessary to refer to the location-semantics representation to improve the expression effect of class-semantics representation to achieve the purpose of ease of use and good performance.By comparing the two kinds of representations,this paper observed the importance of modeling the distinction of foreground and background and proposed introducing the video foreground and background information into class-semantics representation online.Firstly,we estimate the action probability of dense frames online and generate a new representation based on the probability sequence to improve the model’s ability to distinguish the foreground and background and improve the detection effect.Besides,we use multi-stage optimization to alleviate the problem of noise in using the action probability sequence.(2)In the query-based paradigm,cross attention is vital for fusing temporal representation and reducing the proposal’s number.The fusion quality directly affects the effect of the proposal.To improve the quality of cross fusion,we propose a new cross-attention method,which combines dynamic attention and static attention in multi-head space to enhance the generation quality of query embedding.The first one is strongly related to temporal representation and query sequence.The last one is strongly associated with data space.Besides,based on the characteristics of the decoder,we propose a DP switching fine-tuning strategy to improve multistage optimization using dropout.(3)The whole stage method is explored from the application perspective.We implement and optimize a three-stage detection method and propose a detection method based on recognition.A)We put forward a three-stage temporal event detection method with a lightweight feature,a multi-granularity proposal method,and a new proposal classifier.Finally,the detection effect is improved,and the inference efficiency of 8-9 seconds/video is achieved.B)In the detection scenes with frequent and rapid action transformation,the mainstream temporal event detection methods input offline representation and significantly lose the temporal resolution in the subsequent processing.We propose to conduct online dense frame sensing based on the recognition model,which improves the detection effect compared with the mainstream methods.
Keywords/Search Tags:temporal event detection, temporal event proposal, video understanding
PDF Full Text Request
Related items