The rapid growth of multimodal data and the diverse and personalized needs of users have stimulated the development of cross-modal retrieval,which enables the retrieval of relevant data across different modalities,such as text,image,video,and audio.This can provide more intuitive and diversified information services for users.Cross-modal retrieval is also a fundamental task in multimodal learning and understanding,as it involves the issues of feature extraction,semantic association,and relevance measurement between different modalities.It has important implications for exploring the inherent connections and rules among different modalities.Therefore,many researchers have been attracted to this field.This thesis addresses the problem of cross-modal retrieval,which aims to retrieve relevant information across different modalities based on semantic similarity.Crossmodal retrieval poses several challenges,such as: the semantic gap between heterogeneous modalities that prevents direct comparison and measurement of similarity; the complex and diverse association patterns among different modalities,such as one-to-one,oneto-many,and many-to-many,that require effective cross-modal matching models to capture them; and the trade-off between accuracy and efficiency in cross-modal retrieval,that is,how to improve retrieval speed and scale without compromising retrieval quality.To overcome these challenges,this paper focuses on enhancing cross-modal representation learning,which extracts high-level semantic features from different modalities to bridge the semantic gap,handle many-to-many matching problems,and leverage interactive models to boost representation model retrieval accuracy.This thesis makes the following contributions:(1)Entity Relationship Consistency for cross modal Representation Optimization.This thesis proposes an algorithm that tackles the semantic gap problem in crossmodal retrieval tasks.The algorithm is based on the assumption that different modalities should describe the same scene with consistent entity relationships.It imposes constraints on relationship consistency and extracts high-level semantic features to ensure that different modalities with the same semantics have a closer distance in the feature space.This enhances the cross-modal representation ability of the feature encoder.The algorithm works as follows: First,it identifies the entities described by images and text and constructs a scene graph of their interactions.Second,it constrains the consistency of the corresponding scene graphs,enabling the model to learn about intra-modal entity interactions and narrow down the inter-modal semantic gap.(2)Unsupervised Clustering based fine grained Unsupervised Clustering for Cross-modal Retrieval alignment algorithm.To tackle the problem of polysemy ambiguity and many-to-many mapping in cross-modal retrieval tasks,this thesis proposes a fine-grained cross-modal alignment algorithm based on unsupervised clustering.The algorithm addresses the challenge of ambiguity and many-to-many mapping of different modal data.For instance,a picture can have different descriptions for different regions,and the same text meaning can have different expressions.Therefore,extracting a single vector to match different modalities may result in ambiguity and confusion,leading to a lower accuracy of cross-modal retrieval.The algorithm solves this problem by decoupling the different attribute features contained in modal features through unsupervised clustering,and unifying the attribute decoupling of different modal data within a single framework,making the attribute features among modes comparable.Then,it performs matching judgment from different attributes according to different matching pairs,and eliminates redundant information to alleviate the impact of ambiguity and confusion.(3)Cross-modal interaction information fusion-based representation learning algorithm.This thesis proposes a representation learning algorithm that integrates crossmodal interaction information for cross-modal retrieval.The algorithm aims to overcome the performance degradation of representation learning models due to the lack of sufficient interaction information between different modes.The algorithm unifies the interactive learning model and the representation learning model into a single framework,and uses the interactive learning model to optimize the basic feature extraction network.Then,it transfers the interaction knowledge between different modes learned by the interactive learning model to the representation learning model through knowledge distillation,thereby compensating for the lack of interaction information in the representation learning model and improving its retrieval accuracy without adding additional computations.In summary,this thesis presents novel and effective algorithms to address the three major challenges in cross-modal retrieval tasks: semantic gaps between modes,complexity of cross-modal matching,and tradeoffs between retrieval accuracy and speed.The algorithms are evaluated on two open datasets,MS COCO and Flicker 30 K,and show superior performance compared to existing methods. |