Font Size: a A A

Research On Visual Question Answering Based On Graph Neural Networks And Attention Mechanisms

Posted on:2024-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2568307094979779Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,massive multimedia data is growing and accumulating quickly.The multimodal learning tasks between different modalities such as images,texts,videos,audios,etc.which are diverse in form and semantically related,have attracted widespread attention.As a multimodal learning task,Visual Question Answering,which aims to enable computers to answer questions based on image content,is a fascinating research direction in the field of artificial intelligence.It has high practical application value and can be used in scenarios such as surveillance and chatbot conversations.Therefore,it is of interest to study the VQA task.In this article,we explore the interaction between vision and language in the VQA task based on graph neural networks and attention mechanisms,and hope to improve the computer’s ability to answer questions based on image content.The specific work is as follows:(1)Aiming at the problems that the traditional visual attention model BUTD lacks the reasoning ability of the visual object relationship and does not consider the dense semantic interaction between the image and the question text,a VQA model based on the spatial graph convolution network and co-attention network is proposed.The model employs binary relational reasoning as the graph learner module to learn a graph structure representation that captures relationships between visual objects and learns image representation related to the specific question via spatial graph convolution layers.After that,we perform co-attention learning by passing image representations and features of question words through a deep co-attention module.Finally,the learned question word features and visual features are input into the multimodal fusion and answer prediction module,which uses the logistic function in statistics to classify 3,129 candidate answers.(2)Building upon the first model,an improved model based on the gated graph convolution network and bidirectional guided co-attention network is further proposed to explore the impact of explicit spatial relationships in images and symmetric semantic interactions between images and questions on the model performance.Firstly,the model constructs a spatial relationship graph based on the relative spatial positions between visual objects in an image;secondly,through the graph convolutional network which can dynamically control the degree of contribution of different neighbours to the nodes;then,the word features of the question and the visual features with spatial relationship perception capability are fed into a bidirectionally guided co-attention module to jointly learn the dense semantic interactions between them;finally,the learned features are multimodally fused and the answers are predicted by a classifier.(3)The two proposed models are trained and evaluated on the VQA v2.0 dataset.Experiment results demonstrate that the Overall accuracy of the first model delivers 68.12%on the test-std set,which is 2.45% higher than that of the traditional visual attention model BUTD.Furthermore,the Overall accuracy of the second model achieves 71.04% on the test-std set,the reasoning ability of which is further enhanced compared to the previous model.
Keywords/Search Tags:Visual question answering, Spatial graph convolutional network, Spatial relational graph, Co-attention network, Gated graph convolutional network
PDF Full Text Request
Related items