| With the increasing abundance of multimedia data resources,cross modal retrieval research for knowledge sharing of media data and accurate data mining has gradually emerged.Due to the differences in representation and data distribution between different modal data,how to effectively measure the similarity between modalities and achieve accurate matching has become the key to solving the problem of cross modal retrieval.With the development of deep learning,the era of pre-training model has been entered.The multi-modal pre-training model has obtained rich cross-modal understanding capabilities and strong transfer abilities after joint training of multiple modal data.Based on the Contrastive Language-Image Pre-training(CLIP)model,we migrate it to the downstream cross modal image and text retrieval.This paper provides an in-depth study of the following issues: the lack of information interaction between modalities during data processing,the limited effectiveness of traditional category classification loss functions in constraining intra-modal and inter-modal classification,and the absence of a refined distinction for semantic differences within image-text pairs.The main research content is outlined as follows:(1)A cross-modal retrieval model based on CLIP is constructed to maximize semantic relevance and modal alignment.Firstly,feature extraction is performed by utilizing the pretraining model’s existing image and text comprehension capabilities.Secondly,a modality alignment module,based on decomposable attention,is implemented to facilitate modality interaction between same semantic categories.Thirdly,a multi-layer perceptron with weight sharing is utilized to reduce modality heterogeneity in the common representation space and maintain modality invariance.Finally,a constant angle boundary penalty is applied between the feature vectors and the weight matrix through the Arc4 cmr loss to enhance the differentiation between classes,thus achieving the objective of simultaneously increasing the intra-class similarity and inter-class differences.(2)A cross-modal retrieval model based on CLIP was proposed,which utilizes semantic refinement discrimination and modality alignment with inference learning.By using the modality alignment module based on scaled dot-product attention,this approach enhances the correlation of semantically related modal features and learns the modality alignment between two modal data.A semantic approximate matching and correct matching module is designed to enhance the aggregation of intra-class image-text features,and at the same time,to refine the distinction between image-text pairs in the same class that have semantic information differences.Through the comparative loss of mutual supervision,the finegrained matching of features is strengthened.The contrastive loss between similarity matrix of image-text features and the matrix of inter-class label similarity is used to make the loss of semantic approximate matching of intra-class less than that of mismatching of inter-class.Based on pre-training models,this paper proposed two cross-modal retrieval models and carries out sufficient comparative and ablation experiments on public datasets to verify their effectiveness.The experimental results demonstrate that the proposed algorithm models achieved the best evaluation metric scores compared to current typical algorithm models on three datasets,namely Wikipedia,Pascal-Sentence,and NUS-WIDE,effectively improving the accuracy of cross-modal image-text retrieval results. |