| With the rapid updates and iterations in the fields of computer vision and natural language,cross-modal visual dialogue tasks have become an emerging technology in recent years,which combines three features of image recognition,relational reasoning and natural language understanding,and is a popular and challenging cross-modal visual language task in recent years.Therefore,a variety of methods and techniques are needed to search for useful information to meet the complex interaction requirements for different modal features and the problem at hand.To achieve this goal,models need to be able to extract valuable information from visual cues(image information)and textual knowledge(dialogue history information),and thus,how to fully and effectively fuse the multi-modal information involved in the task has become the current mainstream research direction.Most models of visual dialogues usually use image features for correlation reasoning with word features of text,but ignore the connection between images and text passages.That is,these models rarely take into account the semantic relationship between images and text at a higher level,and thus their performance may be limited in some tasks.On the other hand,due to the specificity of the visual dialogue task with three features of image,question and dialogue history at the same time,while the input of the attention mechanism is generally two,a feature fusion module suitable for the task at hand is needed,and a visual dialogue algorithm based on multi-input Transformer with multi-level information fusion is proposed to address the limitations of the mainstream models in terms of the granularity of semantic information fusion.The model contains a word-level multi-step inference module,a question-guided dialogue history passage search module,and a multi-level information fusion decoding module,which takes into account local fine-grained semantic details and global contextual semantic topic information in a more comprehensive way.Among them,the design of the multi-input Transformer module can better realize the parallel encoding and multi-step inference process of multiple information with consistent semantic granularity.Then this paper further investigates based on the above multi-input Transformer model,in order to improve the metrics in the model to measure the ranking quality,a large-scale data pre-training model is used for further modification,and dynamic word embedding is adopted,and the model is pre-trained by a large number of data sets,and then fine-tuned in the downstream task visual dialogue task to make the fitting effect word embedding This model in comparison with the multi-input The effect of this model is further improved compared to the multi-input Transformer model.In addition,a new text paragraph relationship is constructed from another perspective,and the three features of the visual dialogue task are improved,and the fusion problem and historical information become a new feature information,and this is used to fuse with the visual features,and the multi-layer semantic granularity visual dialogue algorithm based on the BERT model has certain superiority compared with the traditional model.To evaluate the effectiveness of the proposed model in this paper,it is compared with the mainstream advanced algorithms on two publicly available datasets,Vis Dial v0.9 and Vis Dial v1.0,respectively.the results show that the proposed model achieves a new and more superior performance. |