Font Size: a A A

Research On Visual Grounding Algorithm Based On Multimodal Feature Pairs

Posted on:2024-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:J J HanFull Text:PDF
GTID:2568307073962529Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the popularization of short video software and the rapid development of Internet and big data technology,people ’s demand for processing cross-modal media information is increasing.The visual grounding task predicts a single bounding box of the language representation region based on a given image and a natural language sentence describing the object in the image.Visual grounding is a multimodal task that combines vision and language,and it is also an important part of the implementation of other multimodal tasks.Therefore,the importance of visual grounding has become increasingly prominent.However,the current visual grounding algorithm has problems such as insufficient representation of modal features and weak perception of object position information,so the multimodal visual grounding task also faces challenges.Aiming at the problem that multi-modal information is difficult to make full use of,this paper proposes two visual grounding networks,which make full use of the effective information advantages of different modalities,extract multi-modal features,and accurately locate the object described by text in the image.Aiming at the problem of insufficient representation caused by the complexity of text image information and the non-complementary use of interactive information,a cross-modal attention graph convolutional visual grounding network is designed by combining attention mechanism and graph convolution.Firstly,the attention mechanism is used to construct a cross-modal attention module to realize the complementarity and correlation between text description and image content,so as to enhance the key features and optimize the multimodal features,and improve the network ’s ability to represent multimodal features.Secondly,the visual language dual-channel graph convolution module is introduced to use graph convolution for visual features and text features respectively,so that it pays attention to the context information of text and image,and obtains more favorable features for visual grounding judgment.The text tests the visual grounding performance of the network through experiments in the public data set.Finally,the experimental results in multiple related data sets of visual grounding show that the proposed network has good performance in visual grounding tasks.To solve the problem of insufficient target perception in one-stage network,and considering the time-consuming characteristics of two-stage design candidate box generation,this paper designs a visual grounding network based on multi-level block perception graph network and forgetting gate mechanism.The network is used for the realization of one-stage visual grounding task.Firstly,in order to better obtain the composition information inside the image and the context information between the sensing objects,a graph convolutional network based on image blocks is proposed.At the same time,in order to obtain the composition information and context information under different receptive fields,the graph convolution network is set to three different levels.Then,the designed cross-modal attention mechanism is used to fuse image features and text features.Finally,the proposed cyclic forgetting gate mechanism is used to fuse graph features and local convolution features at multiple levels to obtain multimodal features containing rich context information and semantic information.The experimental results of the network on public datasets verify the accuracy and effectiveness of the designed network.In addition,comparative experiments and ablation experiments were carried out to verify the contribution of each module in the model.Finally,to test the robustness and stability of the algorithm,this paper designs data to simulate complex text and image input.Through interference experiments,it is shown that the proposed visual positioning network can distinguish the nuances in the language description,and can accurately locate the target object within a certain range for the brightness change in the picture,and has anti-interference effect on the noise of the image.In the experiments of multiple public datasets and complex input conditions,the two multi-modal visual positioning models in this paper have achieved excellent results.
Keywords/Search Tags:Visual Grounding, Graph Convolutional Network, Attention Mechanism, Hierarchial Network, Patch-aware Network
PDF Full Text Request
Related items