| In recent years,Human-Object Interaction(HOI)detection has attracted rising attention.Given an image or a video,HOI detection aims to localize human-object pairs and recognize the interactions between them,so the task plays an important role in scene understanding and anomaly detection in real scenarios,such as anomaly detection in surveillance videos.The thesis focuses on the key technologies of HOI detection,and the research mainly includes the following three aspects:1.In the aspect of interaction-related feature extraction,an interaction-centric graph parsing network is proposed for HOI detection.Given an image,the multi-relation graph convolutional network models one human node as a central node,and other nodes as semantic nodes,which is generated from the proposed interaction-related feature construction module.Furthermore,a multi-IOU(Intersection Over Union)random shift scheme is proposed to augment the data of the training set,and enhance the generalization ability of the network.2.In the design of scene features,a model named multi-modal feature enhancement network with Transformer is proposed.Specifically,a feature fusion module is constructed to generate different interaction features,and the multi-modal scene descriptors are fused to strengthen the contextual expression of interaction features.3.For the problems of long-tailed distribution and noisy labels in datasets,the model jointly supervised by cluster labels and real labels is proposed for long-tailed learning.In addition,to ensure that the model is not penalized too much for predicting missing but correct labels in HOI datasets,a loss function based on the uncertainty of model predictions is constructed.The performance of HOI detection model is continuously improved by optimizing the feature representation,the model structure and the quality of datasets.Extensive experimental results on the HICO-DET and V-COCO datasets imply the effectiveness and generalization of the above algorithms. |