Font Size: a A A

Research On Priors Mitigation And Multimodal Reasoning For Visual Question Answering System

Posted on:2024-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Q JinFull Text:PDF
GTID:2568307076493074Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the development of artificial intelligence,traditional text-based question answering systems can no longer meet the demand for accurate and efficient question answering systems,and people expect question answering systems to have the ability to acquire and understand different types of information.Visual question answering(VQA)systems combine textual and visual image information for joint reasoning.This makes up for the disadvantage of a single source of information in textual question answering and allows for a more natural,intuitive,and accurate question answering.However,the design of the inference module of existing VQA systems cannot cope with complex inference problems such as multi-objective reasoning,and the prevalence of priors problems causes the system to be prone to bias in predicting answers.To improve the accuracy of VQA,the paper investigates both priors mitigation and multimodal inference techniques.In priors mitigation,by analyzing the sources and effects of language and visual priors,we creatively classify the priors of VQA into two types: positive and negative.Different network modules are designed to capture and process the two types of priors to retain the priors that can provide essential information and remove the priors that cause bias to question answering.To further alleviate the problem of priors,the paper designs a dynamically changing feedback objective function using the intermediate results of the capturing priors module to dynamically balance the weights of the loss values according to the strength of the priors.In multimodal reasoning,this paper designs a new multimodal reasoning module to adapt to complex and variable question types.The module enhances the interaction of multimodal feature vectors and the inference of correlations between visual regions by merging spatial coordinates and visual semantic representations.This paper uses standard datasets for comparison experiments.The experimental results demonstrate that the proposed model can effectively mitigate the problem of priors and has strong multimodal inference capability.Meanwhile,the priors mitigation module does not depend on the VQA baseline model and can be used as a plug-in in conjunction with any baseline VQA architectures,which is universally applicable.Using these models and techniques together,the paper designs a VQA-based parenting early education demonstration system to verify the validity and practicality of the model in practical applications.Specifically,the research work of the paper includes the following aspects.(1)To solve the prior problem in VQA systems,a priors mitigation model PMM-VQA(Priors Mitigation Model for VQA)is proposed.The model mitigates language and visual priors knowledge problems by capturing and processing prior knowledge in different modules.Firstly,the language priors are classified into positive and negative,so different network modules are used to capture and process the different priors to alleviate the language priors.Secondly,the theory of language priors is applied to visual information processing.By analyzing the sources of visual priors in VQA,the partial priors are to be removed from visual priors,and they are captured and processed with a separate negative visual prior removal module.In the prediction phase of the model,all the priors for that prediction problem are retained,thus maximizing the available inference information.PMM-VQA model showed the best performance on the VQA-CP v2 data set.The accuracy rate of the PMM-VQA model based on S-MRL was 53.81% with the addition of language prior mitigation module,and the optimal performance was 55.15% with the addition of language and visual priors mitigation module.(2)A visual question answering priors mitigation model VSR-VQA(Visual Semantic Multimodal Reasoning Model for VQA)based on visual semantic multimodal reasoning is proposed to enhance multimodal reasoning.First,attention is focused on the keywords of the question text through an attention mechanism when processing the question text.This method reduces the interference of irrelevant information,and the amount of data to be processed for multimodal reasoning is simplified.Second,a visual semantic multimodal inference module is designed,containing a bilinear super-diagonal fusion module and a visual semantic inference module.The multimodal inference capability of the VQA model is enhanced by strengthening the fine-grained representation and the interaction of multimodal feature vectors,as well as the correlation inference among visual regions.The VSR-VQA model showed the optimal performance of 64.49% in VQA v2 dataset.Finally,the combination of PMM-VQA and VSR-VQA to obtain PMM-VSR model can enhance the multimodal reasoning ability and alleviate the prior problem.The accuracy of PMM-VSR model in VQA v2 and VQA-CP v2 datasets was 62.75% and 54.97%,respectively,which proved that the model achieved a balance between alleviating prior knowledge and retaining reasoning information.(3)To further alleviate the prior problem,a dynamically changing feedback objective function is designed with the intermediate results of some modules in the PMM-VSR model(combined model of PMM-VQA and VSR-VQA).The weight of the loss value of each answer is dynamically set according to the strength of its language prior to balance its proportion in the total visual question answering loss to alleviate the prior further.The model using this feedback objective function has stable accuracy improvement in both VQA v2 and VQA-CP v2 datasets based on multiple visual quiz baseline models,demonstrating the effectiveness and generalizability of this feedback objective function.(4)Using the PMM-VSR model proposed in this paper as the core technology,a VQA-based early childhood education system is designed and implemented.The different functional modules in the system are designed to help children develop expressive,object recognition,counting,and reasoning skills.The effectiveness and practicality of the PMM-VSR model are visually verified through the implementation effect of each functional module of the early childhood education system.
Keywords/Search Tags:Visual Question Answering, Priors Mitigation, Multimodal Inference, Deep Feature Fusion
PDF Full Text Request
Related items