Font Size: a A A

Research On Visual Question Answering Based On Knowledge Graph And Answer Space Optimization

Posted on:2024-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z JiangFull Text:PDF
GTID:2568307130453014Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the two technologies of computer vision and natural language processing continue to develop in the field of artificial intelligence,visual question answering(VQA)has emerged as an interdisciplinary field between the two technologies.VQA refers to the task of answering a natural language question related to an image given an image and a natural language question.However,if the VQA task only uses the information in the given image and text,it is difficult to answer questions that require external knowledge,and the model will depend on the bias of the questions in the training dataset.Based on in-depth research and analysis of existing VQA methods,this thesis proposes a VQA method based on knowledge graph and answer space optimization,and designs and implements a VQA prototype system.The main work of this thesis includes:1.A visual question answering method based on embedding knowledge graph features into image and text representations is proposed.This method enhances the exploration of external knowledge in the input image beyond traditional knowledgebased visual question answering methods.To process external knowledge related to the image,the proposed external knowledge embedding method(KEVR)embeds entity nodes in the knowledge graph as external knowledge features into the image feature representation.As for the external knowledge related to text,the designed Transformer block embeds external knowledge features into the text feature representation.Finally,the model aggregates features from various modalities using a feature aggregator for answer classification.Experimental results demonstrate that the proposed method achieves a lead of 0.66% and 1.61% in accuracy over the best baseline models on two different datasets.2.A method based on external knowledge and semantic loss for answer space optimization is proposed,which outputs answers through feature matching to ensure that the model can give semantically similar answers for similar images and similar questions.This allows the model to use the learned content to answer questions,rather than relying on biases in the dataset.Meanwhile,external knowledge features from the knowledge graph are embedded into the answer feature expression to enhance the accuracy of feature matching.Additionally,a semantic loss mechanism is introduced in the feature matching method to impose different penalties on different types of answers,reducing the model’s dependence on questions during training and alleviating biases in visual question answering.Comparative and ablation experiments demonstrate the effectiveness of the proposed method.Experimental results demonstrate that the proposed method achieves a lead of 1.25%,2.14% and 11.11% in accuracy over the best baseline models on three different datasets.3.Based on the two methods proposed above,a VQA system is designed and implemented.The system mainly includes six modules: data entry,knowledge graph preprocessing,joint feature generation,answer feature processing,answer outputting,and result storing.The system has relatively good ease of use and effectiveness,and is applicable to various VQA scenarios with high application value and prospect.
Keywords/Search Tags:visual question answering, knowledge graph, graph convolution, multi-modal fusion, feature matching
PDF Full Text Request
Related items