| Cross Modal Retrieval plays an important role in Multi-Modal Learning field.Different kinds of data,such as image,voice and text,have its own intrinsic attributes,which are called multi-modal data.Cross modal retrieval aims to explore the semantic relationship between the different modalities.It connects the data with the same semantic information but in different modalities,which helps perform retrieval in different modalities according to the semantic connection.This paper mainly focuses on the retrieval between image modality and text modality.The current Cross Modal Retrieval methods usually concentrate on the fine-grained features in the modality,such as the salient regions in images and words in the sentence.They perform the overall semantic structure matching between modalities by aligning fine-grained features within two modalities.However,these methods usually ignore the relationship among the fine-grained features and pay attention to the entities appeared in both modalities only,which may lead to semantic misalignment.In fact,the aligning the fine-grained features indicates that the two modalities share the same entities,but their relationships and semantic information may be different.Therefore,it is necessary to model the relationships among the fine-grained features,which makes them more discriminative due to the preservation of structure information.When performing cross modal retrieval,we should not only focus on the alignment among the fine-grained features,but also concentrate on the relationships among them,which helps the network be more generalized and identity complex local structure patterns.To solve the problems mentioned above,this paper focuses on reasoning the relationship among the fine-grained features and preserves the local structure.Extensive experiments on two widely used datasets demonstrate that the methods this paper proposed can model the relationship among the fragments in the modalities better and obtained better results.The main contribution of this paper are as follows:(1)A novel masked attention network for fine-grained cross modal retrieval is proposed.Due to the semantic gap between image and text modality,the fine-grained cross modal retrieval methods usually represent the shared semantics as the combination of all the fragments(regions in image or words in text)based on the attention mechanism.Then the similarity between query modality and the shared semantic is obtained,which is also the semantic connection between the two modalities.The relevant fragments obtain more attention,the irrelevant ones obtain less.Although the irrelevant fragments have little impact on the shared semantics,there still exists semantic misalignment caused by them.To solve this problem,the proposed method constructs the mask code based on the relationships among the fragments,which excludes the disturbance of the irrelevant fragments and reinforce the contribution of the relevant fragments.(2)Transformer architecture is introduced into similarity reasoning module.In order to capture the relationships among the fine-grained fragments precisely and construct the local semantic structure of the whole image(text)modality,this paper proposes a Transformer based similarity reasoning network for cross modal retrieval.The proposed network aims to perform reasoning on the similarity between the fine-grained features from different modalities.Due to dividing the whole feature vector into multi-blocks,the similarity vector contains more detailed information.The proposed network is able to model the local structure in the modality by the message passing process of the similarity vector in the Transformer layer,which helps the network identify complex patterns and achieve higher accuracy.(3)A two-stage relation aware similarity reasoning network is proposed.The proposed method is based on Graph Neural Network and Transformer,which model the relationships in the modality more precisely and preserve the local structure.The first stage builds a semantic graph based on graph neural network.The semantic graph regards the fine-grained features in the modality as nodes and the relationships among the fragments are edges.It helps the nodes learn the relation information of the neighbors.The second stage employ Transformer to reason the similarity among the fine-grained features in the modality,which enhance the relationship among them and boost the performance of the network. |