Research And Application Of Visual Question And Answering Algorithm Based On Deep Learning

Posted on:2022-08-01

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Feng

Full Text:PDF

GTID:2558306914462624

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Visual question and answering(VQA)task is a frontier of current artificial intelligence research,which requires the combination of computer vision and natural language processing.This kind of cross-field artificial intelligence task is closer to the way that people obtain information,so it has a very broad application scenario and a very high research value.VQA can be specifically explained as understanding the semantics of a given question,and then looking for effective information from a given image to predict the answer to the question.The classical VQA model includes four modules:image feature extraction,text feature extraction,modal feature fusion and answer prediction.Based on the analysis of the current VQA model,this paper gives an optimization method for the classical VQA model from three perspectives:using multiple attention mechanism,focusing on answer set and improving feature extraction.The main research work and contributions of this paper are as follows.1.In order to effectively integrate cross-modal information,multiple attention mechanisms are introduced in this paper.In addition to multi-head attention in Transformer text encoders,this paper also creatively proposes target attention and spatial attention algorithms for images.The former can identify which areas of an image need to be focused on,and the latter can capture positional associations between objects or regions in the image itself.2.It is found in the study that the existing VQA models are all modified for the first three of the four modules mentioned above,and only focus on the questions and images,while ignoring the role of answer sets in answer prediction.In this paper,a two-step VQA network and a VQA network with embedded image text recognition are proposed to solve the problems of too many invalid and redundant answers in the answer set,and the effective answers may not exist in the answer set.3.The combination of target detection algorithm Faster R-CNN and convolutional network Resnet101 is used to process image information,so that foreground target features and background features are mutually complementary;At the same time,the multi-layer Transformer encoder solves the problem that the long and short memory network encoder ignores the word order,and integrates the low-level word features used to locate the region with the high-level abstract features used to understand the semanticsIn this paper,the experimental results of the proposed optimization methods are presented on several data sets such as VQA v2.0,and the performance of the proposed methods is compared with that of other advanced VQA models,which proves the advanced performance of the proposed methods.

Keywords/Search Tags:

visual question answering, multiple attention mechanism, two-step network, text recognition

PDF Full Text Request

Related items

1	Research On Visual Question Answering Based On Text Semantic Understanding
2	Research On Visual Question Answering Method With Visual Content Understanding And Text Information Analysis
3	Research On Visual Question-Answering Methods Based On Attention Mechanism
4	Research On Visual Question Answering Based On Multiple Attention Mechanism And Feature Fusion Algorithm
5	Research On Visual Question Answering Based On Visual Attention
6	Research On Visual Question Answering Method Based On Attention Mechanism And Multimodal Fusion
7	Exploring Multi-Step Reasoning And Visual Localization In Video Question Answering
8	Research On Visual Question Answering Method Based On Attention Mechanism
9	Question-Guided Attention Reasoning Mechanism For Visual Question Answering
10	Research On Visual Question Answering Based On Deep Neural Network