| In recent years,researchers have paid more and more attention to multimodal tasks that include both image vision and text natural language processing,including video captioning,visual question answering,and so on.The visual question answering multimodal task is different from the traditional single-modal task with only image visual features or only natural language features,which needs to pay attention to the input of both at the same time,reason the text problem according to the characteristics of the image,and then get a reasonable answer.As a multimodal task that includes natural language processing and computer vision,visual question answering mainly includes the extraction of visual features and problem features,feature fusion and answer output modules.This paper conducts research on visual question answering method based on graph neural network model,and the main work is as follows:(1)Aiming at the problem of image spatial feature representation,the graph neural network is used to improve the fusion of image feature representation and text feature.In the previous method,most of the image features and text features were directly fused from the feature extractor,resulting in the lack of advanced spatial feature information,which made the information insufficient,so the model effect was average.In this paper,the graph neural network and polar coordinate function are used to model the spatial division of image features to learn more spatial association information,and then the features of graph structure and text features are re-fused by optimizing the cascading attention mechanism composed of multi-head attention units to improve the reasoning effect.(2)Aiming at the problem that multimodal features cannot effectively model each other,a visual question answering model combining adaptive graph structure and attention mechanism is proposed.The adaptive graph structure redefines the kernel parameter Laplace matrix of spectral convolution,solves the limitation problem of spectral convolution kernel based on near-proximity calculation,and uses Mahalanobis distance to measure the distance between vertices of the graph.Then,the effective BLOCK bilinear fusion technique is added to the attention mechanism.Compared with the superficial features on which existing models rely,this model can better model the inner relationships between different visual and linguistic modalities,and answer questions through relational reasoning. |