| Natural language processing and computer vision continue to make new breakthroughs with the help of deep learning technology.Cross-modal interaction based on the images and texts has gradually become a research hotspot.As one of the multimodal research contents,VQA(visual question answering)has gradually been studied in detail by researchers.The duty of visual question answering is that the computer model extracts image features and question features from the input image and image-related questions,respectively.Then the two features are fused to obtain a joint feature vector.Finally,the multi-classification task of outputting the predicted answer through the predict function.Compared with a single computer vision or natural language processing task,the key to the VQA task is to fully understand the semantic information of the question contained in the image.Extracting image features and measuring key words in questions is very important,and researchers have proposed many visual question answering models.However,most models have unsatisfactory understanding of question semantics and image features.These models show attention to the apparent relevance of questions and answers in the dataset,and therefore ignore the attention to image information.We call this phenomenon language prior.To alleviate the language prior of the model,this degree thesis proposes a strategy to augment the training samples.Its purpose is to enhance the model’s attention to image information and reduce the dependence on the question.The main work is as follows:(1)This degree thesis proposes a visual question answering method based on similar negative samples training model.For the original image-question pair in the dataset,we use the Gram matrix,cosine similarity,and KL divergence method to compute another image from the image set that is closest to the original image in dataset.The most similar image is combined with the original question to form the most similar image-question pair(similar negative samples).All samples are sent to model training,and the auxiliary task of self-supervised learning is used to judge the correlation of image-question pairs,and the comparison information of the predicted answers of the original samples and unlabeled samples is mined.The ultimate goal is to maximize or minimize the predicted answer,so that the model will show solicitude for the important areas that can answer the question correctly when faced with different samples,so as to improve the performance of the model.(2)Based on the model framework in(1),this degree thesis introduces an improved visual question answering model based on counterfactual image negative samples.For more accurate evaluation the effectiveness of using negative samples to alleviate language priors,this degree thesis replaces the negative sample generation method to verify the model performance.The counterfactual image samples are separated into positive counterfactual image samples and negative counterfactual image samples by covering the original image area,and the negative sample image is the noncritical area for answer prediction.Training the model on this sample can adjust the model’s ability to interpret the image under self-supervised loss feedback.Models are encouraged to rely on image content to predict answers,and the model can focus on key regions that correctly predict answers.This degree thesis verifies the effectiveness of negative samples in improving model performance through comparative experiments.On the VQA v2,VQA CP-v2 and VQA CP-v1 datasets,it is verified that our negative sample visual question answering strategy will raise the performance of the model.After various types of experiments,the indicators are compared with the current better models,the results show that the use of similar negative samples and the strategy of data augmentation with counterfactual samples has achieved better performance,which fully verifies the effectiveness of the negative sample strategy in this degree thesis. |