Font Size: a A A

Research On Visual Dialog Technology Based On Visual Coreference Resolution

Posted on:2024-03-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:H W ZhangFull Text:PDF
GTID:1528306944466774Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The most natural way for people to exchange information is through conversation.The way a conversation proceeds is frequently influenced by both the dialog’s language content and the visual setting in which the participants are situated.To this end,the researchers proposed a Visual Dialog(VD)task:an agent is studied that can conduct multiple rounds of continuous question and answer with a human using both language and visual information.Unlike Visual Question Answering(VQA)task,where only a single question needs to be answered,VD requires a succession of answering questions,which may be related to each other.As an important task of information interaction between two different modalities of vision and language,the VD task has increasingly drawn the attention of more researchers and gradually become a research hotspot.A central problem in VD task is Visual Coreference Resolution(VCR),i.e.,modeling the semantic relationship between the current question,the dialog history,and the visual objects in the image.VCR can be decomposed into two subproblems:obtaining the referential information of the current question from the dialog history,named Coreference Resolution(CR),and obtaining the visual objects related to the current question from the image,named Visual Grounding(VG).The existing works still have many shortcomings in solving the above two sub-problems:First,only single-grained semantic representations(word-grained or sentence-grained semantic representations)of the current question and dialog history are used,which is not sufficient to accurately solve the VCR problem;second,the impact of dialog history biases on the model’s prediction of the correct answer to the current question is mostly ignored;third,the majority of the work directly models the semantic relationship to the dialog history for current questions that are not related to the dialog history,which is clearly inappropriate.Based on an extensive review of research related to VD task,this thesis conducts a series of studies to address the shortcomings of existing methods.We first propose a visual dialog model based on multi-granularity semantic collaborative reasoning network,which uses multi-granularity semantic representations of the current questions and dialog history to collaboratively solve VCR problems.However,the above approach directly models the semantic relationship between the current questions,dialog history and visual objects without noticing the impact of bias information in the dialog history on the model performance,therefore,we then propose a visual dialog model based on reciprocal question representation learning network,which learns a semantically explicit new question representation interactively using two types of question representations when solving the subproblem CR,this model solves the CR by using both types of question representations to learn a semantically explicit new question representation interactively,while being able to mitigate the over-fusion of biased information from history into the new question representation.Subsequently,this thesis proposes a visual dialog model based on a two-path collaborative reasoning network,which intends to address the problem that the above two models inappropriately relate the current question directly to the dialog history,when the current question does not need to solve the subproblem CR,The main work and contributions of this thesis are summarized as follows:A visual dialog model based on multi-granularity semantic collaborative reasoning network is proposed,which can capture the historical information and visual objects related to the current discourse more accurately.The model first searches for word-and sentence-grained semantic information in the dialog history related to the current question using a question-aware attention refer module guided by the current question and uses those of historical information to update the word-and sentence-grained semantic representation of the current question.Second,the semantic relations between the updated word-and sentencegrained semantic representations of the current question and the visual object representations are modeled separately using a visual-aware attention alignment module,and these two obtained semantic relationships are used to collaboratively reason about the target visual object associated with the question.Finally,the updated visual objects representations fused with the final question representation are delivered to the decoder to predict the answer to the current question.Experiments on publicly available datasets show that the proposed model in this thesis can more accurately capture the historical information and visual objects related to the current question and significantly improve the accuracy of question answering compared to existing reasoning models based on single granularity semantic representation.A visual dialog model based on reciprocal question representation learning network is proposed,which can balance the use of dialog history information to effectively mitigate the impact of biased dialog history information.When a model is directly fed the dialog history,the model tends to learn the biased information in the history.For instance,the model directly matches words or phrases from the dialog history as the answer to the current question.To address this problem,an adaptive token representation selection module in the proposed model adaptively fuses two types of question representations with and without historical information by a gate function,so that the generated new question representation incorporates less historical bias information.In addition,the proposed model also includes an interactive learning token representation module that makes use of the designed entropy loss function to enable interactive learning of the same tokens in both types of question representations in order to further mitigate the historical bias information fused in the question representation.The proposed model can use dialog history in a more balanced manner,as demonstrated by experiments on publicly available datasets,which not only concentrates on learning dialog history information relevant to the query at hand,but also further mitigates the effect of bias in dialog history on predicting the correct answer to the question.A visual dialog model based on a two-path collaborative reasoning network is proposed,which addresses the problem that existing models inappropriately model the semantic relationship between current questions that talk about new topics and dialog history.During a conversation,one may raise a question that is unrelated to the dialog history,i.e.,introducing a new topic,it may be inappropriate to establish a relationship between this question and the dialog history.Existing approaches directly model the relationship between the question,the dialog history,and the visual objects in images,which may lead to a decrease in the accuracy of the model to infer answers to questions that are not related to dialog history.To address this problem,two paths are designed in the proposed model:an image-only path and an image-history path.In both paths,the representations of the same visual objects are different,and the generated attention distributions on different visual objects are also different.Based on the above differences,this thesis designs a difference-based loss function for collaborative localization of visual objects,which utilizes the similarity of the different representations of visual objects in the two paths to guide the differences of attention distributions between the two paths on different visual objects to accomplish the complementary ability of the two paths.In addition,a multi-source historical information similarity loss function is designed in the image-history path to ensure accurate access to question-related historical information.Experiments on publicly available datasets show that the dual-path model in this thesis improves the accuracy of predicting the answer to the current question compared to a single-path model that directly uses the dialog history.A visual dialog demonstration system for research is designed and implemented.The demonstration system is designed for research purposes,where an image is input under different models and the researcher can ask multiple questions related to the content of the image in succession.For each question asked,the demonstration system outputs the answer to that question,thus helping the researcher to analyze the performances of different models.
Keywords/Search Tags:visual dialog, multimodal fusion, attention mechanisms, multi-granularity semantic representation, interactive learning
PDF Full Text Request
Related items