Human-object Interaction And Video Understanding Under Complex Scenarios

Posted on:2023-11-06

Degree:Master

Type:Thesis

Country:China

Candidate:H W Fan

Full Text:PDF

GTID:2568306914477174

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Human-object Interaction(HOI),which aims at understanding human interaction semantics and interactive object information in visual content,is an important part of visual behavior analysis.Recent research on HOI mainly focuses on interaction understanding in two-dimensional space with limited categories of interactive objects,and has collected many valuable datasets such as HICO-DET,Action Genome,etc.However,such a setting would make interactive objects fit less well with real-world objects.On the one hand,the categories of interactive objects are limited to a certain range,but objects in the real world are often in the long-tailed situation;on the other hand,the twodimensional object information restricts the future prediction of visual system after the interaction event happens.For the above two requirements,the existing HOI datasets and benchmarks are difficult to fulfill the need.In order to solve the problem of insufficient object categories,this paper mainly do the following three tasks.First,this paper collects Discovering Interactive Objects(DIO),a dataset of real-world long-tailed categories,which contains 51 interaction types and 1061 interactive object classes,in which the interactive objects have two-dimensional and three-dimensional positions,and proposes a new 2D-3D multi-modal HOI understanding task,whose goal is to locate the bounding box of the 2D interactive object and the position and size of the 3D interactive objects,and designs the corresponding evaluation criteria for this task.Second,this paper proposes a Multi-modal Object Discovery Network(MODN)to solve this problem.Given the input video frame,MODN first predicts all the interactive people,interaction categories and the corresponding 2D-3D interactive object positions in the video clip,and integrates the 2D-3D modal and human-object information to improve the overall performance.Third,this paper designs relevant baseline experiments to explore the performance of MODN under different network configurations,which provides a basic reference for the task of DIO,and discusses the interaction between multiple modules of MODN.The experimental results show that compared with the basic data AVA,DIO is more difficult,which poses challenges for subsequent applications,and MODN also provides corresponding inspiration.

Keywords/Search Tags:

human-object interaction, video understanding, 3-d reconstruction, action recognition, multi-modal learning

PDF Full Text Request

Related items

1	Behavior Recognition Methods In Complex Scenarios
2	Human Action Recognition Algorithm Based On Multi-modal
3	Research On Key Technology Of Action Recognition Based On Visual Perception
4	Research On Human Action Recognition For Multi-modal Human And Robot Interaction
5	Research On Human Action Recognition Based On Multi-modal Video
6	General Interactiing Object Detection Algorithms For Action Understanding
7	Multi-modal Human Action Recognition
8	Analyzing And Understanding Human Actions In Videos
9	The Method Of Human-object Interaction Action Recognition
10	The Research And Application Of Human Action Recognition Based On The Mining Of Potential Association Of Multi-Modality Features