Research On Methods Of Visual Question Answering Based On Adaptive Multimodal Feature Fusion

Posted on:2024-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:D S Yuan

Full Text:PDF

GTID:2568307079455534

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Visual question answering(VQA)task is a cross-domain task that combines computer vision and natural language processing to intelligently understand multimodal information from images and text.However,due to data imbalance,VQA methods have language bias issues in multimodal feature fusion and training,where text information and features dominate the model’s results.VQA models with language bias overly rely on the correlation between questions and answers,ignoring the information in the image,resulting in poor robustness and low interpretability of the model in application.On one hand,language bias originates from data imbalance and feature overfitting,leading to overfitting of the model to the head samples in the data set.On the other hand,this overfitting also highlights the poor robustness of current VQA methods in the face of complex and variable contents and possible attacks in practical applications.Therefore,this thesis studies the theory and methods of VQA based on adaptive multimodal feature fusion:(1)From the perspective of data balance,data augmentation and contrastive learning methods are used to achieve data balancing.Data augmentation techniques can expand the amount of data input into the model by generating counterfactual examples from the training set and improve balance.Contrastive learning can help the model learn more accurate and robust feature representations by analyzing the factual examples,counterfactual examples,and original samples as triplets.These methods balance data distribution and improve model performance and robustness.(2)From the perspective of feature balance,the concept of generalization uncertainty is proposed,and a multi-student network is combined for multi-view feature balance.Generalization uncertainty refers to the uncertainty of the model’s prediction results for new samples,which can help the model resist overfitting and improve its generalization ability.Multi-student networks can learn more robust feature representations by training multiple networks to prevent the model from overfitting to certain views.Experimental results show that they can effectively alleviate language bias,improve model performance and robustness.(3)From the perspective of robustness and generalization verification,a collaborative adversarial training method is proposed by studying the performance of a biased visual question answering model under multimodal attacks.Based on the assumption that the model results of adversarial and original samples should be the same,the robustness of the model is improved by constraining the feature consistency of adversarial and original samples.In addition,a defensive distillation method is proposed to reduce the model’s sensitivity to attacks.Defensive distillation uses knowledge distillation to smooth gradients,thereby reducing the steepness of the original model’s gradients and improving its generalization ability.Experimental results show that they can effectively improve the robustness of the model and its performance under adversarial attacks.

Keywords/Search Tags:

Visual Question Answering, Language Bias, Knowledge Distillation, Adversarial Attack and Defense

PDF Full Text Request

Related items

1	Research On Language Bias Of Visual Question Answering Model
2	Research On Visual Qusetion Answering With Suppressing Biased Samples
3	Research Of Visual Question Answering Based On Cross-media Multimodal Representation Learning
4	Research On Key Technologies Of Knowledge Base Question Answering Based On Incremental Learnin
5	Research On Question-answering Method Oriented To Small Data Volume Vertical Field
6	Research On Key Technologies Of Question Answering Based On Pre-Training Models
7	Research On Visual Question Answering Method With Visual Content Understanding And Text Information Analysis
8	Research On Visual Question Answering With Deep Metric Learning
9	Research On Chinese Intelligent Question Answering Method Based On Knowledge Graph
10	Research On Key Technologies Of Visual Question Answering Based On Metric Learnin