Research On Image-Text Cross-Modal Matching Based On Attention Mechanism

Posted on:2022-10-26

Degree:Master

Type:Thesis

Country:China

Candidate:H Yuan

Full Text:PDF

GTID:2518306737956889

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of mobile Internet technology and the wide popularization of intelligent communication devices,people can collect and disseminate their interested data anytime and anywhere.Consequently,the global data scale presents explosive growth.In the meanwhile,the types of data are also showing a trend of diversification.Image,text,video,voice and other multimodal data are rapidly generated at low cost,and quickly spread on the network.One of the main reasons why big data can carry out efficient information transmission on the Internet platform is that it contains a variety of different modal data types.The corresponding,complementary and mutual conversion of these multimodal data can accelerate the information transmission between data.In order to help users quickly and accurately query valuable data information from the growing multi-modal data,it is very important to study cross modal data matching and search.Image and text are two kinds of representative modal data in multimodal data.Overcoming the significant modal heterogeneity between these two types of modal data and extracting the core semantic features for cross modal matching is one of the hot issues in the field of image text matching,and attention mechanism can play an important role in solving this problem.On the one hand,the traditional image feature expression learning is easy to capture a lot of redundant information which has nothing to do with association analysis,so context attention mechanism is needed to learn discriminative visual features.On the other hand,coarse-grained matching is lack of local detail semantics,so it is necessary to use local cross modal attention mechanism to model fine-grained correspondence in the context of coarse-grained alignment.The specific research contents are as follows:(1)Towards the difficulty in extracting the core semantic features required for image-text association analysis,this paper proposes an image-text matching method based on Recursive Canonical Correlation Analysis(RCCA),which includes a Long and Short-Term Memory Recurrent Neural Network(LSTM-RNN)for dynamic image representation learning.The network uses the contextual attention mechanism to selectively focus on the salient content in the image,and then integrates the content that LSTM-RNN focuses on in the first few steps into the global image feature expression,so as to better mine the core semantics.Information used for association analysis,while filtering out some irrelevant redundant content in the image.In addition,we use conventional LSTM-RNN to encode the text,and serialize the semantic information of all words into the global feature expression of the text sequence.Finally,canonical correlation analysis(CCA)is used to correlate the feature representations of image and text modal data through maximum linear correlation learning to achieve more accurate cross-modal matching.A lot of experimental analysis shows that the proposed RCCA method has better performance than the previous CCA method on the task of image-text matching.(2)Towards the difficulty in fine-grained cross-modal image-text matching,this paper focuses on fine-grained cross-modal image-text matching based on visual semantic inference.Existing work in the field of image-text cross-modal matching mainly learns the global feature representations of two modalities,i.e.,image and text.These representations are then embedded into a common multi-modal semantic space for cross-modal similarity measurement learning.However,these coarse-grained matching solutions may result in the loss of local details and semantics.Thus,fine-grained image-text matching is still a challenging problem to be solved.To solve this problem,this paper proposes an Improving Visual Semantic Reasoning model(VSR++),which is based on image-text coarse-grained matching in the context of using local cross-modal attention mechanism to additionally model Region-word fine-grained correspondence.To make better use of the complementary advantages between different granularities of matching,we also introduce a simple and effect joint training strategies to balance the relative importance between them.A large number of experimental analysis shows that the proposed VSR++ method in this paper achieves the leading performance of the current image-text matching task on two benchmark data sets.

Keywords/Search Tags:

Image-text cross-modal matching, Recurrent canonical correlation analysis, Contextual attention mechanism, Visual semantic reasoning, Local cross-modal attention mechanism

PDF Full Text Request

Related items

1	Attention Mechanism Based Cross-Modal Semantic Alignment
2	Research On Image-text Cross-modal Hash Retrieval Based On Semantic Preservation And Attention Mechanism
3	Image-text Translation Based On Cross-modal Related Semantics And Attention Mechanism
4	Research On Image-Text Retrieval Algorithm Based On Semantic Reasoning
5	Visual Signal Reconstruction In Cross-modal Communication
6	Research On Text-Image Cross Modal Retrieval Method
7	Research On Cross-Modal Image-Text Retrieval Techniques Based On Semantics And Common Sense
8	Multi-View Feature Inference Network For Cross Modal Matching
9	A Self Attention Guided Network For Cross-modal Matching
10	Cross-modal Matching Based On Vision And Language