Visual Dialogue Algorithm Based On Multi-level Semantic Information Fusion And Reasoning

Posted on:2024-05-04

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Cui

Full Text:PDF

GTID:2568307112476664

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

With the rapid updates and iterations in the fields of computer vision and natural language,cross-modal visual dialogue tasks have become an emerging technology in recent years,which combines three features of image recognition,relational reasoning and natural language understanding,and is a popular and challenging cross-modal visual language task in recent years.Therefore,a variety of methods and techniques are needed to search for useful information to meet the complex interaction requirements for different modal features and the problem at hand.To achieve this goal,models need to be able to extract valuable information from visual cues(image information)and textual knowledge(dialogue history information),and thus,how to fully and effectively fuse the multi-modal information involved in the task has become the current mainstream research direction.Most models of visual dialogues usually use image features for correlation reasoning with word features of text,but ignore the connection between images and text passages.That is,these models rarely take into account the semantic relationship between images and text at a higher level,and thus their performance may be limited in some tasks.On the other hand,due to the specificity of the visual dialogue task with three features of image,question and dialogue history at the same time,while the input of the attention mechanism is generally two,a feature fusion module suitable for the task at hand is needed,and a visual dialogue algorithm based on multi-input Transformer with multi-level information fusion is proposed to address the limitations of the mainstream models in terms of the granularity of semantic information fusion.The model contains a word-level multi-step inference module,a question-guided dialogue history passage search module,and a multi-level information fusion decoding module,which takes into account local fine-grained semantic details and global contextual semantic topic information in a more comprehensive way.Among them,the design of the multi-input Transformer module can better realize the parallel encoding and multi-step inference process of multiple information with consistent semantic granularity.Then this paper further investigates based on the above multi-input Transformer model,in order to improve the metrics in the model to measure the ranking quality,a large-scale data pre-training model is used for further modification,and dynamic word embedding is adopted,and the model is pre-trained by a large number of data sets,and then fine-tuned in the downstream task visual dialogue task to make the fitting effect word embedding This model in comparison with the multi-input The effect of this model is further improved compared to the multi-input Transformer model.In addition,a new text paragraph relationship is constructed from another perspective,and the three features of the visual dialogue task are improved,and the fusion problem and historical information become a new feature information,and this is used to fuse with the visual features,and the multi-layer semantic granularity visual dialogue algorithm based on the BERT model has certain superiority compared with the traditional model.To evaluate the effectiveness of the proposed model in this paper,it is compared with the mainstream advanced algorithms on two publicly available datasets,Vis Dial v0.9 and Vis Dial v1.0,respectively.the results show that the proposed model achieves a new and more superior performance.

Keywords/Search Tags:

deep learning, visual dialogue, cross-modal fusion, Transformer, Paragraph Level, BERT

PDF Full Text Request

Related items

1	Deep Learning Based Video-Text Cross-Modal Retrieval
2	Research On Cross-Modal Matching Technologies Based On Deep Learning
3	A Study Of Multi-modal Image Fusion Algorithm Based On Transformer Models
4	Research On Visual Perception Technology Based On Multi-modal Fusion
5	Research On Audio-Visual Event Localization And Recognition Based On Cross-Modal Learning
6	Research On Cross-modal Applications Via Exploiting High-level Semantics
7	Research Of Visual Question Answering Method Based On Deep Learning
8	Research On The Generation Method Of Dialogue Summary Based On Deep Learning
9	Research On Cross-Modal Learning Methods For Audio-Visual Association
10	Research On Deep Compatibility Learning For Cross Audio-Visual Media Matching