Font Size: a A A

Description Generation Method Based On Dynamic Scene Graph

Posted on:2024-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:J W LiuFull Text:PDF
GTID:2568307079470814Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the rapid advancement of deep learning and computer vision techniques,image recognition and detection networks based on deep convolutional neural networks have achieved remarkable results.The deep features extracted from these networks can adequately capture the instance information in the image.However,merely detecting and recognizing objects in images is not enough to meet the demands of deep learning applications.People also expect the model to infer deeper semantic relations between objects.Therefore,the task of scene graph generation arises,which converts the input visual information into a semantic graph structure that facilitates downstream reasoning tasks.There are various methods to process visual features for scene graph generation,such as extracting global features based on convolutional neural networks,extracting regional features based on object detection models,and generating graph structure representations based on scene graph generation models.Compared with global features and regional features,scene graph representations can express more fine-grained information as well as the relations and attributes among instances.However,there are also some problems that affect the quality of scene graph generation and the performance of downstream reasoning tasks,such as the long-tail distribution of the data set that leads to the lack of informative relation predicates in the generated scene graphs,and the incompleteness of the scene graph annotations.This thesis focuses on the intersection of video data and scene graph generation,and aims to improve the quality of visual descriptions by alleviating the longtail problem of the data set and providing fine-grained visual information through auxiliary supervision tasks.This thesis proposes a dynamic scene graph generation framework based on multitask learning.Specifically,this thesis is inspired by the common learning of the same modality semantics by different tasks in multi-task learning,which enables the scene graph generation task to better extract local semantic details that may be overlooked by previous works,and generate more fine-grained semantic information.For video scene graph data sets,a global feature module and a hybrid attention mechanism module are proposed to encode local spatial information and scene action information to ensure the information consistency between frames in the video,which helps to mitigate the long-tail problem of dynamic scene graphs and improve the overall predicate prediction accuracy.Experiments show that compared with the current state-of-the-art methods in the field of dynamic scene graph generation,this thesis’ s method using multi-task learning and hybrid attention mechanism can significantly improve the quality of video scene graph generation,and can better extract high-value relation predicates in videos.
Keywords/Search Tags:Scene graph Generation, Cross-modal Learning, Multi-task Learning, Long-tail Problem, Attention Mechanism
PDF Full Text Request
Related items