Font Size: a A A

Research On Methods Of Visual Question Answering Based On Adaptive Multimodal Feature Fusion

Posted on:2024-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:D S YuanFull Text:PDF
GTID:2568307079455534Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Visual question answering(VQA)task is a cross-domain task that combines computer vision and natural language processing to intelligently understand multimodal information from images and text.However,due to data imbalance,VQA methods have language bias issues in multimodal feature fusion and training,where text information and features dominate the model’s results.VQA models with language bias overly rely on the correlation between questions and answers,ignoring the information in the image,resulting in poor robustness and low interpretability of the model in application.On one hand,language bias originates from data imbalance and feature overfitting,leading to overfitting of the model to the head samples in the data set.On the other hand,this overfitting also highlights the poor robustness of current VQA methods in the face of complex and variable contents and possible attacks in practical applications.Therefore,this thesis studies the theory and methods of VQA based on adaptive multimodal feature fusion:(1)From the perspective of data balance,data augmentation and contrastive learning methods are used to achieve data balancing.Data augmentation techniques can expand the amount of data input into the model by generating counterfactual examples from the training set and improve balance.Contrastive learning can help the model learn more accurate and robust feature representations by analyzing the factual examples,counterfactual examples,and original samples as triplets.These methods balance data distribution and improve model performance and robustness.(2)From the perspective of feature balance,the concept of generalization uncertainty is proposed,and a multi-student network is combined for multi-view feature balance.Generalization uncertainty refers to the uncertainty of the model’s prediction results for new samples,which can help the model resist overfitting and improve its generalization ability.Multi-student networks can learn more robust feature representations by training multiple networks to prevent the model from overfitting to certain views.Experimental results show that they can effectively alleviate language bias,improve model performance and robustness.(3)From the perspective of robustness and generalization verification,a collaborative adversarial training method is proposed by studying the performance of a biased visual question answering model under multimodal attacks.Based on the assumption that the model results of adversarial and original samples should be the same,the robustness of the model is improved by constraining the feature consistency of adversarial and original samples.In addition,a defensive distillation method is proposed to reduce the model’s sensitivity to attacks.Defensive distillation uses knowledge distillation to smooth gradients,thereby reducing the steepness of the original model’s gradients and improving its generalization ability.Experimental results show that they can effectively improve the robustness of the model and its performance under adversarial attacks.
Keywords/Search Tags:Visual Question Answering, Language Bias, Knowledge Distillation, Adversarial Attack and Defense
PDF Full Text Request
Related items