Font Size: a A A

Visual Question Answering Based On Optical Character Recognition

Posted on:2024-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:M J HanFull Text:PDF
GTID:2568307115977189Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of deep learning techniques and the increasingly large amount of data,models using information from single modality can no longer meet the needs of people.In multimodal modeling research,Visual Question Answering is a suitable task to explore the interactive fusion of multimodal information,which aims to give a picture with a question described by natural language and let the model answer the answer accurately.Recently,it has been found that Optical Character Recognition texts,i.e.,textual information in pictures,are important for models to understand multimodal scenes in the natural world.Therefore,a Visual Question Answering task based on Optical Character Recognition texts is proposed.After conducting domestic and international research on traditional Visual Question Answering and Visual Question Answering based on Optical Character Recognition texts,this paper points out that there are shortcomings in multimodal interaction capability,training prediction process,and training data volume at present:(1)The problem of insufficient multimodal interaction capability.There are OCR text modalities,visual object modalities,and question modalities in the input data.Existing models usually simply fuse these information by concatenating operation,but the relationship between these modalities is complex.There is a spatial location relationship between OCR text modality and visual objects modality.There is a guided relationship between the question modality and the OCR text modality and the visual object modality.(2)The problem of inconsistent training and prediction process.In the baseline model for TextVQA task,the real answer labels are used as input in training and a zero-vector is used as input in prediction.And the model is inferred once in training and twelve times in prediction.(3)The problem of insufficient training data.In the existing dataset,each image usually corresponds to only 1~2 question-answer pairs.And there are 12 OCR words per image in the dataset on average,which indicates that a large number of OCR texts are wasted and not used for constructing question-answer pairs.In order to solve the above problems,this paper proposes a Text-based Visual Question Answering method based on multi-source interaction and a diversity data augmentation method based on prompt words.The research and innovations in this paper are as follows:(1)A Text-based Visual Question Answering method based on multi-source interaction and noise enhancement is proposed,including a visual fusion module,a question guidance module,a self-aggregation module,and a noise enhancement module.The first three modules utilize the cross-modal relationship feature to construct different attention fusion methods: the visual fusion module proposes spatial appearance attention using appearance features and location features;the question guidance module proposes question-guided attention through common keywords;the self-aggregation module utilizes the self-attention mechanism to strengthen the connection between OCR texts.The noise enhancement module forces the alignment of input data during training prediction.The noise enhancement module mainly uses a scheduled masking strategy: as the number of training steps grows,the input real data are randomly replaced with zero-vectors using the return value of the masking function as the replacement probability.The replacement probability gets larger as the training grows.(2)A diversity data augmentation method based on prompt words and OCR grouping is proposed.The method reverses the input and output of the Text-based Question Answering model to construct the data augmentation method.The inputs of the data augmentation model are images,answers,and OCR texts,and the outputs are pseudo-labeled questions.The prompt word method and OCR grouping algorithm are also proposed.In the prompt word approach,different question types are first matched according to the search algorithm,and then the large-scale language model is used to learn the relationship between question types and syntax so that different types of questions can be generated with different prompt words.The OCR grouping algorithm can make the proportion of multi-word answers in the pseudo-labeled question-answer pairs higher and more in line with natural laws.(3)A large number of experiments and ablation experiments on the TextVQA dataset demonstrate the effectiveness of the model and method proposed in this paper.
Keywords/Search Tags:Visual Qustion Answering, Optical Character Recognition, Multi-modal Interaction, Teacher Forcing, Data Augmentation
PDF Full Text Request
Related items