With the advent of the big data era,the multimodal data in the internet is growing exponentially,and unimodal retrieval methods can no longer meet the increasingly rich retrieval needs of users.To help users manage and utilize the massive multimodal data more efficiently,cross-modal retrieval has become a hot topic of current research.Cross-modal retrieval refers to using one modality of data to retrieve semantically related data of another modality.The biggest bottleneck it faces is the heterogeneity of multimodal data,where different modalities of data cannot be directly measured for similarity.The current solution is to map different modalities of data to a common subspace for similarity measurement to complete retrieval.Cross-modal retrieval can be divided into supervised and unsupervised methods based on whether category labels are used as supervision.These two methods have their own advantages and disadvantages,and this thesis focuses on the problems that exist in both methods,as follows:(1)Most supervised cross-modal retrieval methods only consider the influence of inter-modal similarity and have difficulty in fully utilizing the semantic information contained in category labels.This thesis proposes a supervised cross-modal hashing method based on intra-modal similarity and semantic preservation.First,different measurement criteria are selected for different modalities to measure intramodal and inter-modal similarities,respectively.The similarity information between data is preserved by minimizing the loss function to improve the retrieval accuracy of the model.Next,the generated hash code is inputted into a fully connected network to obtain a category prediction code with the same length as the number of category semantic labels.The category prediction code is made as similar as possible to the true category label by using a loss function,which fully preserves the semantic information contained in the category label.Finally,the generated hash code is subjected to quantization and bit-balancing constraints to further improve its quality and thus enhance the model’s retrieval accuracy.(2)Compared with supervised cross-modal retrieval,unsupervised cross-modal retrieval methods eliminate the model’s dependence on category labels,saving a lot of human and material resources.Contrastive learning,as a common unsupervised learning method,has achieved many results in the field of information retrieval,but there are still problems of inflexible sample selection and neglecting local correlation information.This thesis proposed an unsupervised cross-modal retrieval method based on global-local contrastive learning.A new scoring mechanism is used to flexibly select positive and negative samples,avoiding the positioning of semantically similar data far away in the common subspace.Furthermore,to further improve the model’s discriminative performance,a variational autoencoder is introduced to generate positive anchor data for learning.Finally,to fully consider the local correlation information between different modalities of data,a soft attention mechanism is used to focus on the detailed features between image regions and text words,completing the local semantic alignment through contrastive learning to improve the model’s retrieval accuracy. |