| In the era of informatization,multi-modal data,as a medium for representing,storing,and transmitting information,is showing an explosive growth trend.Effective information retrieval from massive multi-modal data has important research significance and huge application requirements.Cross-modal retrieval aims to retrieve relevant results in other modalities based on the query items of a certain modality given by the user,such as using text to retrieve images.Benefiting from the more flexible and practical retrieval patterns of cross-modal retrieval,it has been widely used in many practical scenarios such as search engines and recommendation systems,and has become a research hotspot in the field of information retrieval.The key issue of cross-modal retrieval is to effectively mine the relationship between data of different modalities,and then learn more accurate cross-modal similarity.The mainstream methods of cross-modal retrieval focus on learning a common embedding space for data of different modalities to bridge the "heterogeneous gap" between them.Since complementary semantic information mining plays an important role in cross-modal similarity learning,this thesis will use this as an entry point to study the design method of deep neural networks for cross-modal retrieval.This thesis starts from a new research perspective of fully mining "inter-modality complementary semantic information" and "intra-modality complementary semantic information" to learn cross-modal similarity,realize semantic enhancement,and improve the accuracy of cross-modal retrieval.Aiming at mining the above two types of complementary semantic information,this thesis proposes two deep networks for cross-modal retrieval:(1)To mine the inter-modality complementary semantic information,a "multi-modal and multi-grained semantic enhancement network" is proposed.This network contains two stages to collect and fuse rich complementary semantic information scattered at the globaland local-level of different modalities,and effectively bridges the ‘‘heterogeneity gap’’ and the ‘‘granularity gap’’ to learn more accurate cross-modal similarity.In the first stage,the concepts of "primary modality" and "auxiliary modality" are introduced to define "primary similarity" and "auxiliary similarity",and then the global-level subnetwork and local-level subnetwork are constructed based on coarse-grained features and fine-grained features,respectively.In the second stage,a novel "multi-spring balance loss" is proposed,and the samples most in need of optimization are selected to construct the multi-spring balance systems,then the cross-modal similarity is adaptively optimized by exploiting the potential interaction between samples.In each subnetwork,the multi-spring balance loss is used to jointly optimize the primary similarity and auxiliary similarity,so that the valuable semantic knowledge contained in the auxiliary similarity is transferred to the primary similarity,which can effectively capture the inter-modality complementary semantic information and significantly improve the performance of cross-modal retrieval through semantic enhancement.By comparing with multiple methods on several datasets and designing related ablation experiments,the effectiveness and superiority of the network are sufficiently proved.(2)To mine the intra-modality complementary semantic information,a "context-aware multi-branch alignment network" is proposed.This network fully mines contextual complementary semantic information within each modality to further achieve precise alignment between different modalities through 1)comprehensively modeling and 2)inferring valuable fine-grained intra-modality interactions.Firstly,a "context-aware cell" is designed based on the self-attention mechanism and the gate mechanism,which is used to restrain some useless interaction relationships between fine-grained features,and adaptively control the internal information flow,so as to effectively model the intra-modal context.Then,to learn more comprehensive cross-modal associations,three-branch alignment modules are designed for inferring the cross-modal similarity at different semantic levels.Finally,the cross-modal similarities at different semantic levels of a certain sample pair are learned by three modules of "summary","object" and "relationship".Afterwards,these cross-modal similarities are effectively integrated via jointly optimizing the semantic consistency loss and the cross-modal alignment loss,which simultaneously achieves two goals: 1)effective complementarity of different types of cross-modal similarities,and 2)precise alignment between samples of different modalities.Compared with the state-of-thearts on the two benchmark datasets of Flickr 30 K and MS-COCO,our proposed solution is proved to be effective,and the ablation experiment further proves the effectiveness of each key module in this network.To sum up,based on the problems existing in the previous research,this thesis proposes two types of networks for cross-modal retrieval,which fully exploit "inter-modality complementary semantic information" and "intra-modality complementary semantic information" to achieve semantic enhancement,and thus significantly improve the accuracy of cross-modal retrieval. |