Multi-View Feature Inference Network For Cross Modal Matching

Posted on:2022-09-07

Degree:Master

Type:Thesis

Country:China

Candidate:J Wu

Full Text:PDF

GTID:2558307109969549

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Cross-modal matching has attracted increasing attention since it is associated with two important modalities of vision and language.The goal of cross-media matching is to find the closest matching result among other modes given the input of one mode and vice versa.This paper mainly researches the image-text matching.Recently,image-text matching based on local region-word semantic alignment has attracted considerable research attention.However,the similarities of aligned region-word pairs are treated equally in most cross-modal matching literatures,without considering their respective importance.Moreover,the local alignment methods are prone to bring about a global semantic drift due to the ignorance of thematic considerations for the image-text pairs.In addition,there has been little research on the semantic relationship of local sentences between objects.How to learn a comprehensive and unified representation method to express the data of different modes is also a key challenge.In allusion to the problems mentioned above,the main achievements are presented as follows:Dual-View Semantic Inference Network for Image-Text Matching In this paper,a novel Dual-View Semantic Inference(DVSI)network is proposed to leverage both local and global semantic matching in a holistic deep framework.For the local view,a region enhancement module is proposed to mine the priorities for different regions in the image,which provides differentiate abilities to discover the latent region-word relationships.For the global view,the overall semantics of image is summarized for global semantic matching to avoid global semantic drift.Extensive experiments conducted on common datasets demonstrate the effectiveness of the proposed DVSI.Region Reinforcement Network with Topic Constraint for Image-Text Matching There are two shortcomings in the dual-view semantic inference network.First,the global semantic matching directly inferences the global features of the image according to BI-GRU.Each image region is treated equally without considering the influence of the relationship between regions on the whole image.Second,in the local semantic matching,the weighted average of the region enhancement module is carried out according to the attention.Lack of aggregation of spatial contextual information in the image.The major novelty in our proposed region reinforcement network model lies in two-fold.On the one hand,the topic constraint module is presented to summarize the central theme of images,which constrains the original image deviation.On the other hand,a region reinforcement module is proposed,which uses avg-max pooling to aggregate spatial regional information and collect important intelligence of different regional features.Meanwhile,a large number of experimental results on common datasets verify that our proposed method improves by 2-4% compared with the original local matching methods.Graph attention network for cross media matching In this paper,we construct a cross-media matching network based on the graph attention network.The network constructs the graph structure of the image regions and the text words to carry out graph matching,which infers the fine-grained structural correspondence relationship.Meanwhile,according to the generated graph structure,global semantics are inferred for global matching as a supplement to graph matching,thereby achieving more comprehensive cross-media semantic matching.Extensive experimental results show that the proposed cross-media matching method based on graph attention network can learn graph matching and global matching at the same time.Competitive results are obtained on MSCOCO and Flickr30 K datasets.

Keywords/Search Tags:

Cross-modal matching, Global semantic matching, Local semantic matching, Multi view inference, Attention mechanism

PDF Full Text Request

Related items

1	Research On Image-Text Cross-Modal Matching Based On Attention Mechanism
2	Cross-modal Matching Based On Vision And Language
3	Research On Key Technologies Of Text Semantic Matching Based On Structural Features And Multi-layer Information Interaction
4	Attention Mechanism Based Cross-Modal Semantic Alignment
5	Research On Algorithm Of Emotion Semantic Matching Between Music And Text
6	Research On Cross-modal Multimedia Retrieval Methods Based On Semantic Matching
7	A Self Attention Guided Network For Cross-modal Matching
8	Research And Implementation Of Semantic Matching Technology For Intelligent Question Answering System
9	Study On The Strategy And Improvement Method Of Multi-view Matching Technology
10	A Multi-View Image Matching Method Based On Semi-Global Optimization And Its Application