| Visual Question Answering(VQA)is a multimedia understanding task that requires computers to answer natural language questions related to the content of a given image.Since its conception,VQA has attracted considerable attention from researchers.Emotional Visual Question Answering is an extension of VQA that not only answers questions related to visual content but also answers questions related to visual emotions while incorporating emotional information into the answers.The main contributions of this paper are as follows:(1)Early VQA models performed poorly in addressing emotion-related questions,mainly due to the neglect of emotional information in the images and insufficient utilization of key regions in the images and key words in the text,leading to shallow understanding of fine-grained questions and thus affecting the accuracy of the answers.To fully incorporate image emotion information into VQA models and use this emotion information to enhance the model’s ability to answer questions,we propose an Image Emotion Enhanced Multimodal Visual Question Answering model(IEVQA).The IEVQA model consists of two main modules: a semantic module and an emotion module.The semantic module is responsible for processing semantic information in VQA tasks,while the emotion module focuses on analyzing emotional attributes in images.These two modules share the same Transformer encoder to achieve the fusion of semantic and emotional information when processing questions.Experiments on the related VQA benchmark dataset demonstrate the effectiveness and superiority of the IEVQA model.The final experimental results show that the IEVQA model performs better on comprehensive indicators than other comparison methods,and validates the effectiveness of using emotional information to assist VQA models.(2)Current emotional VQA tasks mainly focus on multiple-choice VQA.However,existing emotional VQA models tend to produce less natural answers after incorporating emotional information,and introducing emotional information before obtaining the answer reduces the accuracy of choices.To naturally incorporate image emotion information into multiple-choice VQA answers without compromising model accuracy,we propose a Promptbased Image Emotional Visual Question Answering Method(PIEVQA).PIEVQA designs explicit emotional prompt texts for each image’s emotional information and inputs the correct answer chosen by the VQA model along with the emotional prompt text into the pretrained GPT-3 model to obtain an answer with emotional information.Experiments on the VQA-V2 and Visual7 W datasets verify the effectiveness of PIEVQA.The experimental results show that compared to other VQA models,PIEVQA generates more natural and human-like emotional answers that better describe the emotional information of images while maintaining high accuracy.This provides new insights for emotional expression in the VQA field and paves the way for new application scenarios of prompt-based VQA methods. |