Font Size: a A A

Research And Application Of Visual Question Answering And Explanation Generation

Posted on:2023-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y X QianFull Text:PDF
GTID:2568306914471684Subject:Intelligent Science and Technology
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is a cross-field research including computer vision and natural language processing technology.It is one of the most representative tasks in the field of multimodal learning.VQA requires a machine to answer natural language questions based on images,while the task of explanation generation aims to generate corresponding explanations for the given answers,which can effectively improve the interpretability of the model.This paper analyzes the deficiency in existing methods of both VQA and Explanation Generation.Current VQA model is lack of the modeling and utilization of spatial location relationship,and the utilization of different categories of relationship information is lack of effective coordination and guidance,some noise and redundant information have not been effectively eliminated.Meanwhile,the existing generated explanation has shortcomings in readability and coherence,previous methods carried out explanation generation and VQA independently,which weakens the value of explanation generation tasks.Therefore,this paper conducts the following two aspects of work:Considering the issues mentioned above,a question-driven graph fusion visual question answering model is designed.Firstly,the model proposes a novel method to make more effective use of the fine-grained spatial information contained in the image.In addition,this model utilizes the global representation of questions to guide the fusion process of different graph attention networks,so as to effectively coordinate the role of different categories of relations.Finally,this model adopts an object filtering mechanism based on the object importance coefficient to remove the objects with low correlation with the question in the image.The model has achieved a competitive performance on VQA 2.0 dataset compared with other relation-aware methods.Further,the model also achieves sota performance on VQA-CP v2 dataset,which is commonly used to test the generalization ability of a model.A series of ablation experiments demonstrate that each module plays a positive role in improving the performance of the model.Based on QD-GFN,an explanation generation module is complemented to compose a model,which can generate explanation and answer the question at the same time(E-VQA).This model adopts the idea of multi-task learning to reduce coupling degree between different tasks.On the open-ended explanation generation dataset VQA-X,the performance of this model surpasses the previous methods on the evaluation indexes BLEU-4,METEOR,ROUGE,SPICE and CIDER.Further experiments illustrate that explanation generation can promote the model to achieve better performance in VQA.Finally,this paper builds a system for E-VQA.The system can feed back answers and explanations according to the images and questions input by users,and visually displays the attention area in image.
Keywords/Search Tags:Visual Question Answering, Explanation Generation, Visual Relation Modeling, Graph Attention Network
PDF Full Text Request
Related items