Font Size: a A A

Human-object Interaction And Video Understanding Under Complex Scenarios

Posted on:2023-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:H W FanFull Text:PDF
GTID:2568306914477174Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Human-object Interaction(HOI),which aims at understanding human interaction semantics and interactive object information in visual content,is an important part of visual behavior analysis.Recent research on HOI mainly focuses on interaction understanding in two-dimensional space with limited categories of interactive objects,and has collected many valuable datasets such as HICO-DET,Action Genome,etc.However,such a setting would make interactive objects fit less well with real-world objects.On the one hand,the categories of interactive objects are limited to a certain range,but objects in the real world are often in the long-tailed situation;on the other hand,the twodimensional object information restricts the future prediction of visual system after the interaction event happens.For the above two requirements,the existing HOI datasets and benchmarks are difficult to fulfill the need.In order to solve the problem of insufficient object categories,this paper mainly do the following three tasks.First,this paper collects Discovering Interactive Objects(DIO),a dataset of real-world long-tailed categories,which contains 51 interaction types and 1061 interactive object classes,in which the interactive objects have two-dimensional and three-dimensional positions,and proposes a new 2D-3D multi-modal HOI understanding task,whose goal is to locate the bounding box of the 2D interactive object and the position and size of the 3D interactive objects,and designs the corresponding evaluation criteria for this task.Second,this paper proposes a Multi-modal Object Discovery Network(MODN)to solve this problem.Given the input video frame,MODN first predicts all the interactive people,interaction categories and the corresponding 2D-3D interactive object positions in the video clip,and integrates the 2D-3D modal and human-object information to improve the overall performance.Third,this paper designs relevant baseline experiments to explore the performance of MODN under different network configurations,which provides a basic reference for the task of DIO,and discusses the interaction between multiple modules of MODN.The experimental results show that compared with the basic data AVA,DIO is more difficult,which poses challenges for subsequent applications,and MODN also provides corresponding inspiration.
Keywords/Search Tags:human-object interaction, video understanding, 3-d reconstruction, action recognition, multi-modal learning
PDF Full Text Request
Related items