Font Size: a A A

Cross-modal Retrieval Research Based On Reinforcement Learning

Posted on:2024-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:H YangFull Text:PDF
GTID:2568307079976549Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Cross-modal retrieval refers to the process of information retrieval among different media types such as image,audio and text.Cross-modal retrieval has become one of the research hotspots in the fields of computer vision,natural language processing and machine learning in the past few years.In recent years,image-text based cross-modal retrieval has made great progress and has attracted more and more attention.The core task of cross-modal retrieval is to accurately measure the similarity between multimodal data.In the interactive cross-modal image retrieval scenario,the heterogeneity of data among multiple modalities and the unbalanced distribution of data among modalities bring many challenges to model construction.First of all,the traditional interaction method is to passively receive user feedback,and then iteratively supplement incomplete information,which will lead to a large amount of user feedback,consume too much energy of the user,and cause the model retrieval to take too long.Secondly,when the user only describes some local areas in the image,the matching of the retrieval results usually fails due to the incomplete information provided.Thirdly,due to the difficulty in obtaining human-computer dialogue data,fully supervised training is unrealistic and human-annotated dialogue is required.Finally,due to the mixed noise in the multimedia data,it also has a certain interference effect on the robustness of the metric learning algorithm.Therefore,how to improve the robustness of metric learning methods is also a big problem.In view of the above problems,this paper proposes the main innovative methods to solve these problems as follows:First,this paper proposes a novel interactive cross-modal retrieval framework with human-computer interaction via inquiry/confirmation.In addition,because it is difficult to obtain comprehensive human-computer dialogue data,a fully supervised training method is unrealistic.Therefore,this paper adopts a weakly supervised training method,which only needs to obtain a data set of image text.This not only reduces the workload of data processing,but also saves a lot of time.Second,this paper proposes a reinforcement learning strategy that enables the model to actively search for clearly distinguishable objects,find out the missing discriminative details in the current query information,and supplement the missing discriminative information,instead of passively receiving these from user feedback.information,which greatly improves the retrieval performance of the model,and is more practical than other dialog-based retrieval models.Third,on the basis of the interactive cross-modal retrieval framework,an in-depth study of metric learning techniques is carried out,and a maximum polynomial loss function is designed to provide a robust metric loss function for cross-modal retrieval tasks.Experimental results show that,This loss function significantly improves the convergence rate and retrieval efficiency of the cross-modal retrieval model.
Keywords/Search Tags:Interactive cross-modal retrieval framework, Reinforcement Learning, Max Polynomial Loss
PDF Full Text Request
Related items