| Visual question answering task requires models to generate answers based on both the image and the question,making visual information and questions crucial factors for models to produce correct answers.However,most current VQA models suffer from language bias,where models only learn surface-level connections between questions and answers in the training data,without fully utilizing visual information for reasoning.The language bias can lead models to rely on high-frequency words or phrases to answer questions,while ignoring less common but relevant words or phrases.This can cause models to fail in predicting the correct answers for questions involving novel scenes or objects,which undermines the practical application of VQA models.To mitigate the language bias problem,this paper takes a visual perspective,analyzes the reasons for language bias,and improves the model from a distance metric learning perspective.Specifically,the main contributions of this paper are as follows:(1)This paper proposes an unsupervised distance metric learning for visual question answering method based on unsupervised metric learning,which mitigates language bias by masking out question-irrelevant visually regions.The proposed method masks out visual regions in the image that are not related to the question,thereby reducing the difficulty of image understanding for the VQA model.The proposed method only uses the questionrelevant visual features to generate the answer,which enhances the model’s attention to the key regions of the image.Additionally,the proposed method utilizes a triplet loss function to unsupervisedly separate the positive visual features and negative visual features in a high-dimensional semantic space,thereby accurately extracting the positive questionirrelevant visually features.(2)This paper proposes a self-supervised counterfactual metric learning method to alleviate the language bias problem in VQA models.Most current VQA models employ traditional supervised learning approaches,which simply use the final loss function for supervision,lacking strong supervision signals.This approach only explicitly supervises the final predicted result and ignores the causal relationship between the question-image pair and the prediction result,resulting in the presence of language bias.This method improves the unsupervised distance metric learning for visual question answering method in this paper and proposes an adaptive feature selection module,which adaptively separates visual features into question-relevant and question-irrelevant visual features and directly predicts the answer based on the question-relevant visual features,ensuring the actual cause of the predicted answer.Additionally,the method constructs counterfactual samples by question-irrelevant visual features to provide counterfactual supervision signals for the model,further reducing language bias.This method guides model training through causal inference and can effectively alleviate the language bias problem.(3)This paper conducts extensive experiments on several public datasets to demonstrate the effectiveness of the proposed approach.The results analysis proves that the proposed method achieves state-of-the-art performance on several public datasets without the need for additional manual annotation.This paper provides a new idea and method for improving the performance and robustness of VQA models. |