Font Size: a A A

Research On Student Behavior Detection And Description Method Based On Video Understanding

Posted on:2022-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:X Q ChenFull Text:PDF
GTID:2517306344452084Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
With the development and popularization of smart education,video data related to students is growing explosively.As two important research tasks in video understanding,video temporal action detection and description generation span two technical fields:computer vision and natural language processing.In this paper,student behavior videos are taken as the research object,and the methods of video temporal action detection and dense video captioning are discussed to provide technical support for subsequent student behavior analysis.The main innovative work includes:(1)A single-stage temporal action detection method based on(Timeception Single Shot Action Detector,TC-SSAD)network is proposed.By using cascaded Timeception layer and super event module,this method solves the problems that the existing single-stage temporal action detection methods can not adapt to the diversity of action time span,and do not effectively use the temporal structure and context information of the whole input video.The experimental results show that the mAP of TC-SSAD in student temporal action detection dataset(STAD)and THUMOS14 datasets is 22.3%and 44.3%respectively,which is 2.7%and 2.4%higher than original network.Among them,the STAD dataset has achieved significant performance improvement in five action categories,such as "listening to music through headphones"(13.6%)and "playing with mobile phones"(22.1%).In ActivityNet-1.3 dataset,the average mAP is 20.4%,better than the original network 0.61%.(2)A weak supervised temporal action detection method based on Two-stream completeness modeling is proposed(Two-stream Completeness Modeling,TSCM),the problem of imprecise marking of start and end time of weak action is avoided.On the one hand,the multi branch action completeness modeling module is used to generate complete action instances for RGB features and optical flow features respectively.On the other hand,ACL-PT(Angular Center Loss with a Pair of Triplets)loss function is introduced to suppress background frame interference and learn more discriminative foreground and background features.The experimental results show that the mAP of TSCM is 27.45%when IoU=0.5 on STAD dataset,among them,the best AP(75.8%)was obtained for reading in STAD dataset,and the AP of "taking notes" and "sleeping on the table" were higher than 57%.In THUMOS14 and ActivityNet-1.2 datasets,the average map was 34.8%and 22.1%respectively,which was better than W-TALC,Autoloc and other mainstream methods.(3)The student dense video captioning dataset(SDVC)is constructed,and a multi-modal feature-based dense video captioning model is proposed.Based on BMN network,the model introduces audio information to generate action proposals;based on Bi-modal Transformer network,the model uses temporal semantic relation module to model the rich time structure and semantic information among multiple events in video.The experimental results show that the average METEOR is 17.48%when using the generated action proposals on SDVC dataset.On ActivityNet Captions dataset,the average METEOR is 11.32%when using ground truth action proposals,which is better than that of the original network by 0.42%;when using generated action proposals,the average METEOR is 8.03%,which is better than the current mainstream dense description generation methods such as Bi-modal Transformer.
Keywords/Search Tags:Temporal action detection method, Weak supervised method, Dense video captioning method, videos analysis
PDF Full Text Request
Related items