| Visual-language understanding in multimodal learning is a key area in the new generation of artificial intelligence,crucial for promoting Chinese artificial intelligence development strategy,digital economy growth,and enhancing national competitiveness in the information age.Its core task is to simulate human comprehensive analysis and collaborative decision-making abilities with multimodal data,establishing informational connections between vision and language,and bridging the semantic gap across modalities.Currently,intelligent agents like Flamingo,CLIP,GPT-4V,etc.,have achieved significant accomplishments in scene-level visual-language understanding tasks.However,they mainly provide coarse-grained scene descriptions based on prominent entities and have not yet realized fine-grained understanding of visual scenes.This is essential for multimodal agents to understand complex and varied real-world scenes.Therefore,this research focuses on the task of visual scene graph generation,aiming to explore how to refine and provide scene-level fine-grained visual language representations for multimodal models,to support complex visual reasoning tasks requiring detailed and comprehensive visual semantic information.Specifically,this study represents visual scenes through the construction of scene graphs,depicting entities and their interactions within the scenes.Nodes in the scene graph represent the category,location,and attribute information of entities,while interactions between entities are abstracted through relational predicates and connected as edges between subjects and objects,ultimately forming a graph structure representation with triples like “subject-predicate-object” as the smallest unit.Due to the semantic overlap in visual relationships and the long-tail distribution,existing scene graph models have serious biases in predicting entity relationships.The main problems are as follows: first,models struggle to distinguish semantically similar predicates,accurately describing visual scenes; second,scene graph models overfit and have lower convergence efficiency for uncommon semantic predicates; third,the models lack combinational generalization ability,i.e.,accurately predicting visual relationships under uncommon subject-object combinations.The root of these problems lies in the existing methods’ failure to fully consider the complex,dynamic,and imbalanced semantic information in visual relationships.Firstly,effectively selecting predicates that accurately depict visual scenes depends on the semantic expression in complex contexts.Specifically,the semantic similarity or distinction difficulty of predicates changes with the context of subject-object.Furthermore,the relational predicate associativity reflects the model’s distinguishing ability and learning state for different predicates,dynamically changing during the learning process.Finally,the polysemy of relational predicates means that the same predicate may express multiple semantic meanings in different contexts.Therefore,the main objective of this dissertation is to understand how to induce and understand the semantic information of visual relationships in the context of imbalanced visual relationships,complex and diverse subject-object combinations of predicates,and semantic overlap,achieving adaptive adjustment in the learning process of scene graph models for balanced,efficient,and accurate visual relationship recognition.This dissertation proposes corresponding solutions to three key issues in visual relationship generation,including:(1)How to use the associativity of visual relationships to enhance the model’s finegrained recognition ability for semantically similar predicates? This dissertation presents a method for analyzing the associativity of predicates in complex contexts to guide the model in more accurately recognizing fine-grained predicates in scene graphs.Due to imbalanced data distribution and semantic overlap,existing scene graph methods typically favor common predicates(head predicates)in the same semantic scene,while distributionbalanced methods tend to favor uncommon predicates(tail predicates).Therefore,both types of methods struggle to achieve fine-grained recognition of semantically similar predicates.To address this issue,this dissertation proposes a task of Fine-Grained Predicate Recognition.By analyzing the impact of subject-object pairs on the associativity of predicates,the Predicate Correlation Matrix(PCM)is constructed to fully represent the associative information of predicates.By comprehensively considering the distribution and associativity of predicates,inter-class regularization adjustments are made during the learning process of fine-grained predicates,enabling the model to distinguish easily confused predicates and ultimately generate fine-grained scene graphs.(2)How to enable the model to capture dynamically changing fine-grained semantic information for balanced and efficient recognition? This dissertation proposes an adaptive dynamic predicate learning strategy,which dynamically monitors the learning state of the model and adaptively updates the associativity information of predicates to gradually optimize the model’s discrimination process,thereby ensuring the balance and efficiency of the learning process.Existing scene graph de-biasing methods can alleviate biased predictions for semantically similar predicates to some extent.However,because the inductive bias information they use is inconsistent with the model’s learning state,it leads to slow convergence and over-correction.These problems further hinder the model’s balanced and efficient recognition of inter-class predicates.Therefore,this dissertation creates an Adaptive Predicate Correlation Matrix(PCM-A),achieving the goal of updating the relational predicate associations according to the learning state of the model.In addition,the Adaptive Fine-grained Discriminating Loss is designed,effectively using the fine-grained associativity information between predicates to adaptively adjust the model’s learning process,thereby achieving efficient and balanced recognition of fine-grained predicates.(3)How to balance the learning of multiple semantic-concepts within the same predicate category? This dissertation proposes a multi-concept learning framework,aiming to ensure the model achieves balanced learning on rare,general,and common concepts.Compared to unbiased prediction between predicate categories,balanced learning within categories poses a higher research value and challenge.Due to the diversity and imbalance of the contextual environment within categories,the model faces difficulties in learning the semantics of predicates in uncommon contexts.Therefore,this dissertation first proposes the Generalized Unbiased Scene Graph Generation(G-USGG)task,aiming to make the model simultaneously focus on balanced learning and unbiased prediction both between and within categories.Specifically,by quantifying the semantic imbalance among predicates,these semantic-concepts are represented as multiple Concept-Prototypes within the same predicate category.Subsequently,Concept Regularization(CR)techniques and a Balanced Prototypical Memory(BPM)module are designed to achieve effective and balanced representation capabilities for multiple semantics within a category.Finally,this dissertation comprehensively summarizes the above research findings and provides an outlook on potential research directions that may significantly impact the future development of visual scene graph generation tasks. |