Font Size: a A A

Research On Image-Text Cross-Modal Matching Based On Attention Mechanism

Posted on:2022-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:H YuanFull Text:PDF
GTID:2518306737956889Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet technology and the wide popularization of intelligent communication devices,people can collect and disseminate their interested data anytime and anywhere.Consequently,the global data scale presents explosive growth.In the meanwhile,the types of data are also showing a trend of diversification.Image,text,video,voice and other multimodal data are rapidly generated at low cost,and quickly spread on the network.One of the main reasons why big data can carry out efficient information transmission on the Internet platform is that it contains a variety of different modal data types.The corresponding,complementary and mutual conversion of these multimodal data can accelerate the information transmission between data.In order to help users quickly and accurately query valuable data information from the growing multi-modal data,it is very important to study cross modal data matching and search.Image and text are two kinds of representative modal data in multimodal data.Overcoming the significant modal heterogeneity between these two types of modal data and extracting the core semantic features for cross modal matching is one of the hot issues in the field of image text matching,and attention mechanism can play an important role in solving this problem.On the one hand,the traditional image feature expression learning is easy to capture a lot of redundant information which has nothing to do with association analysis,so context attention mechanism is needed to learn discriminative visual features.On the other hand,coarse-grained matching is lack of local detail semantics,so it is necessary to use local cross modal attention mechanism to model fine-grained correspondence in the context of coarse-grained alignment.The specific research contents are as follows:(1)Towards the difficulty in extracting the core semantic features required for image-text association analysis,this paper proposes an image-text matching method based on Recursive Canonical Correlation Analysis(RCCA),which includes a Long and Short-Term Memory Recurrent Neural Network(LSTM-RNN)for dynamic image representation learning.The network uses the contextual attention mechanism to selectively focus on the salient content in the image,and then integrates the content that LSTM-RNN focuses on in the first few steps into the global image feature expression,so as to better mine the core semantics.Information used for association analysis,while filtering out some irrelevant redundant content in the image.In addition,we use conventional LSTM-RNN to encode the text,and serialize the semantic information of all words into the global feature expression of the text sequence.Finally,canonical correlation analysis(CCA)is used to correlate the feature representations of image and text modal data through maximum linear correlation learning to achieve more accurate cross-modal matching.A lot of experimental analysis shows that the proposed RCCA method has better performance than the previous CCA method on the task of image-text matching.(2)Towards the difficulty in fine-grained cross-modal image-text matching,this paper focuses on fine-grained cross-modal image-text matching based on visual semantic inference.Existing work in the field of image-text cross-modal matching mainly learns the global feature representations of two modalities,i.e.,image and text.These representations are then embedded into a common multi-modal semantic space for cross-modal similarity measurement learning.However,these coarse-grained matching solutions may result in the loss of local details and semantics.Thus,fine-grained image-text matching is still a challenging problem to be solved.To solve this problem,this paper proposes an Improving Visual Semantic Reasoning model(VSR++),which is based on image-text coarse-grained matching in the context of using local cross-modal attention mechanism to additionally model Region-word fine-grained correspondence.To make better use of the complementary advantages between different granularities of matching,we also introduce a simple and effect joint training strategies to balance the relative importance between them.A large number of experimental analysis shows that the proposed VSR++ method in this paper achieves the leading performance of the current image-text matching task on two benchmark data sets.
Keywords/Search Tags:Image-text cross-modal matching, Recurrent canonical correlation analysis, Contextual attention mechanism, Visual semantic reasoning, Local cross-modal attention mechanism
PDF Full Text Request
Related items