Font Size: a A A

Research And Application Of Visual Question And Answering Algorithm Based On Deep Learning

Posted on:2022-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:J Y FengFull Text:PDF
GTID:2558306914462624Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Visual question and answering(VQA)task is a frontier of current artificial intelligence research,which requires the combination of computer vision and natural language processing.This kind of cross-field artificial intelligence task is closer to the way that people obtain information,so it has a very broad application scenario and a very high research value.VQA can be specifically explained as understanding the semantics of a given question,and then looking for effective information from a given image to predict the answer to the question.The classical VQA model includes four modules:image feature extraction,text feature extraction,modal feature fusion and answer prediction.Based on the analysis of the current VQA model,this paper gives an optimization method for the classical VQA model from three perspectives:using multiple attention mechanism,focusing on answer set and improving feature extraction.The main research work and contributions of this paper are as follows.1.In order to effectively integrate cross-modal information,multiple attention mechanisms are introduced in this paper.In addition to multi-head attention in Transformer text encoders,this paper also creatively proposes target attention and spatial attention algorithms for images.The former can identify which areas of an image need to be focused on,and the latter can capture positional associations between objects or regions in the image itself.2.It is found in the study that the existing VQA models are all modified for the first three of the four modules mentioned above,and only focus on the questions and images,while ignoring the role of answer sets in answer prediction.In this paper,a two-step VQA network and a VQA network with embedded image text recognition are proposed to solve the problems of too many invalid and redundant answers in the answer set,and the effective answers may not exist in the answer set.3.The combination of target detection algorithm Faster R-CNN and convolutional network Resnet101 is used to process image information,so that foreground target features and background features are mutually complementary;At the same time,the multi-layer Transformer encoder solves the problem that the long and short memory network encoder ignores the word order,and integrates the low-level word features used to locate the region with the high-level abstract features used to understand the semanticsIn this paper,the experimental results of the proposed optimization methods are presented on several data sets such as VQA v2.0,and the performance of the proposed methods is compared with that of other advanced VQA models,which proves the advanced performance of the proposed methods.
Keywords/Search Tags:visual question answering, multiple attention mechanism, two-step network, text recognition
PDF Full Text Request
Related items