Font Size: a A A

Research And Application Of Image And Language Cross-modal Deep Learning In The Field Of Instrumentation

Posted on:2023-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y F GaoFull Text:PDF
GTID:2532306911482854Subject:Measuring and Testing Technology and Instruments
Abstract/Summary:PDF Full Text Request
With the development of deep learning,various fields of artificial intelligence have been greatly improved,including natural language processing,multi-modal processing,etc.In recent years,great progress has been made in the multi-round dialogue rewriting task,the multi-modal image and text question-answering task,and the cross-modal dialogue task.However,studies on cross-modal visual dialogue question answering tasks are relatively rare,which contribute to the development of artificial intelligence.The deep learning cross-modal visual dialogue question answering task can be divided into two sub-tasks: the multi-round dialogue rewriting task and the multi-modal image and text question-answer task.The text question and answer involves two directions of modal fusion and alignment.This thesis believes that the method of collaborative learning can be used as a means to assist multi-modal tasks,and the introduction of collaborative learning can help complete the task of cross-modal visual dialogue question answering.First of all,for the multi-round dialogue rewriting task,it is easy to produce ambiguity because there are many references and omissions in the multi-round dialogue.If the multimodal model directly inputs the dialogue without rewriting,due to the lack of words in the text,it is impossible to image feature are aligned with text embedding,so complete sentences are rewritten using a multi-turn dialogue rewriting module.Aiming at the existing multi-round dialogue rewriting model that uses historical dialogue to provide word information,this thesis introduces a collaborative learning mechanism.On the basis of multiround dialogue rewriting,contextual visual collaborative information is added.Multimodal graphic question answering is an answer given based on image information.Usually,the answer is contextual collaborative visual information.This answer is integrated into the next round of multi-round dialogue rewriting tasks to form a synergy between visual information and text information,and introduce collaborative learning.A mechanism that enables visual information to be restored in the text to be rewritten,improving the accuracy of rewriting.This thesis proposes a cross-modal collaborative visual dialogue question answering model incorporating contextual visual collaborative information.Incorporating contextual visual synergy information into multi-round dialogue rewriting tasks strengthens the role of visual information in sentences,and derives rewritten sentences containing visual information.The rewritten sentence is converted into text embedding,and the image features are extracted as visual information,and the dual-stream multi-modal processing method is adopted.First,the two parts are independently encoded,and then the two parts of the information are crosslearned cross-modal processing,the cross-modal processing can modal fusion and alignment of visual information and text,and finally obtain cross-modal information and get answers to the questions of image and text.This information will be used as contextual visual collaborative information into the next round of multi-round dialogue rewriting tasks,forming a collaborative learning mechanism.The collaborative learning mechanism can improve the accuracy of multi-round dialogue rewriting,and also improve the effect of multi-modal fusion and alignment of dialogue.Based on this idea,the cross-modal visual dialogue question answering task is theoretically studied,the key formulas are given,and the overall model diagram is given.The model is divided into multi-round dialogue rewriting module,multi-modal fusion alignment module and collaborative learning module.The construction of a cross-modal collaborative visual dialogue question answering model is completed.
Keywords/Search Tags:Natural language processing, Deep learning, Multi-round dialogue rewriting, Multimodality, Visual question answering, Collaborative learning, Instrument field
PDF Full Text Request
Related items