Font Size: a A A

Multi-View Feature Inference Network For Cross Modal Matching

Posted on:2022-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:J WuFull Text:PDF
GTID:2558307109969549Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Cross-modal matching has attracted increasing attention since it is associated with two important modalities of vision and language.The goal of cross-media matching is to find the closest matching result among other modes given the input of one mode and vice versa.This paper mainly researches the image-text matching.Recently,image-text matching based on local region-word semantic alignment has attracted considerable research attention.However,the similarities of aligned region-word pairs are treated equally in most cross-modal matching literatures,without considering their respective importance.Moreover,the local alignment methods are prone to bring about a global semantic drift due to the ignorance of thematic considerations for the image-text pairs.In addition,there has been little research on the semantic relationship of local sentences between objects.How to learn a comprehensive and unified representation method to express the data of different modes is also a key challenge.In allusion to the problems mentioned above,the main achievements are presented as follows:Dual-View Semantic Inference Network for Image-Text Matching In this paper,a novel Dual-View Semantic Inference(DVSI)network is proposed to leverage both local and global semantic matching in a holistic deep framework.For the local view,a region enhancement module is proposed to mine the priorities for different regions in the image,which provides differentiate abilities to discover the latent region-word relationships.For the global view,the overall semantics of image is summarized for global semantic matching to avoid global semantic drift.Extensive experiments conducted on common datasets demonstrate the effectiveness of the proposed DVSI.Region Reinforcement Network with Topic Constraint for Image-Text Matching There are two shortcomings in the dual-view semantic inference network.First,the global semantic matching directly inferences the global features of the image according to BI-GRU.Each image region is treated equally without considering the influence of the relationship between regions on the whole image.Second,in the local semantic matching,the weighted average of the region enhancement module is carried out according to the attention.Lack of aggregation of spatial contextual information in the image.The major novelty in our proposed region reinforcement network model lies in two-fold.On the one hand,the topic constraint module is presented to summarize the central theme of images,which constrains the original image deviation.On the other hand,a region reinforcement module is proposed,which uses avg-max pooling to aggregate spatial regional information and collect important intelligence of different regional features.Meanwhile,a large number of experimental results on common datasets verify that our proposed method improves by 2-4% compared with the original local matching methods.Graph attention network for cross media matching In this paper,we construct a cross-media matching network based on the graph attention network.The network constructs the graph structure of the image regions and the text words to carry out graph matching,which infers the fine-grained structural correspondence relationship.Meanwhile,according to the generated graph structure,global semantics are inferred for global matching as a supplement to graph matching,thereby achieving more comprehensive cross-media semantic matching.Extensive experimental results show that the proposed cross-media matching method based on graph attention network can learn graph matching and global matching at the same time.Competitive results are obtained on MSCOCO and Flickr30 K datasets.
Keywords/Search Tags:Cross-modal matching, Global semantic matching, Local semantic matching, Multi view inference, Attention mechanism
PDF Full Text Request
Related items