Font Size: a A A

Research On Visual Dialog Based On Inference Reinforcement

Posted on:2024-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z F ZhangFull Text:PDF
GTID:2568306941963639Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Visual Dialog is one of the most important tasks in the intersection of computer vision and natural language processing.The task is to develop multiple rounds of dialog based on a given image,which requires AI agents to have high cognition and interactive reasoning capabilities for both visual and textual features.Most existing methods fail to obtain accurate reasoning results in complex task scenarios due to the insufficient reasoning ability.This dissertation tackles the reasoning problem in Visual Dialog from three aspects:semantic reasoning,knowledge reasoning and zero-shot reasoning.The main research contents are as follows:(1)To solve the problem of inaccurate target mapping caused by the poor inductive bias ability of attention mechanism between visual and text modalities,this dissertation proposes the method that couples attention mechanism and convolution to enhance the semantic reasoning ability of the model.This method mainly relies on the strong inductive bias ability of the convolution network.On the basis of the attention model,it not only ensures the model capacity for processing large-scale data,but also enhances the inductive ability of the model to prior knowledge,thus helping to extract and infer key modal features.Experimental results show that the method makes the semantic understanding of the model more accurate.(2)To solve the problem of inaccurate knowledge understanding caused by insufficient reasoning interaction between external knowledge and internal multi-modal data in visual dialog,this dissertation proposes a knowledge-aware causal inference network.First of all,the external commonsense knowledge is generated according to the entities extracted from the question.Then the spurious causal relationship between the history dialog and the answer is cut off and eliminate confounding factors among features using external knowledge.Finally the external commonsense knowledge is fused with internal language and visual features with multi-level feature fusion to get the final answer.The experiments prove the necessity of introducing commonsense knowledge and the ineffectiveness of commonsense knowledge extracted from history dialog.At the same time,comparisons with knowledgeunaware framework and graph-based knowledge-aware framework on VisDial v 1.0 dataset show the superiority of our proposed causality based framework which can achieve more accurate knowledge reasoning effects.(3)To solve the problem that the model lacks the ability to infer unknown features in the limited dataset during training,resulting in poor generalization performance,this dissertation proposes a zero-shot reasoning network based on relation regularization.This method uses the visual relation regularized module to seek known visual features that are semantically close to the unknown visual features,and helps to understand the unknown features through the known features.Similarly,for the unknown answer,this paper uses the text relation regularized module to strengthen the keywords in the answer,and strengthen the understanding of the unknown answer through the relationship between the keywords.Experiments on our first proposed two zero-shot datasets and the standard testset show that our modules can improve the understanding and reasoning ability of the models,and can improve the model to a comparative level.According to the aforementioned research content,this dissertation analyzes internal semantic interactive reasoning enhancement,external knowledge reasoning enhancement based on causality,and zero-shot reasoning ability in visual dialog from different perspectives.Comparative experiment results,ablation experiment results,and visualization results have proven the effectiveness of the methods which improve the reasoning ability of visual dialog.
Keywords/Search Tags:Visual Dialog, Commonsense Knowledge, Zero-shot Learning, Attention Mech-anism, Convolution
PDF Full Text Request
Related items