Font Size: a A A

Research On Text-Image Cross Modal Retrieval Method

Posted on:2022-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y HuFull Text:PDF
GTID:2558306488980969Subject:Engineering
Abstract/Summary:PDF Full Text Request
Text-image cross-modal retrieval refers to two modal data of text and image.Without using tags,it achieves accurate matching retrieval through implicit internal semantic connections.With the development of the Internet and social media,the amount of information on multimedia data on the Internet has exploded.At present,single-modal retrieval cannot meet the needs of users.Users are increasingly eager to be able to accurately retrieve from massive amounts of multi-modal data.Retrieve the content you are interested in,so cross-modal data retrieval has become a current research hotspot in the field of information retrieval.The most important part of the text-image cross-modal retrieval task is the semantic analysis and association matching of multi-modal data.Although deep learning technology has greatly improved the accuracy of cross-modal retrieval,the current text-image cross-modal retrieval task still faces very big technical challenges:(1)The computer can effectively extract the low-level features of the image,and it is difficult to directly obtain the high-level features.Semantic information,so there is a "semantic gap" between low-level features and high-level semantics.(2)The underlying structure of text features and image features is heterogeneous,and it is difficult to directly measure the similarity between different modal data,so there are "heterogeneity differences" between different modal data.In response to the above-mentioned challenges,based on deep learning technology,this paper designs appropriate solutions to alleviate the semantic gap in a targeted manner,reduce the heterogeneity differences between modalities,and improve the performance of text-image cross-modal retrieval.(1)Aiming at the problem of the "semantic gap" between the low-level features and highlevel semantics,a text-image cross-modal retrieval method that fuses the salient features of images is proposed.Most of the current work only focuses on global image features or local features,but the fusion saliency features proposed in this paper can simultaneously fuse objectlevel saliency features without losing the global image features.The salient features generally detect the entity objects or regions in the image,which are the most important elements of the high-level semantic information of the image,and they are often the focus of people’s attention.Therefore,the salient features will bring effective information to text-image retrieval,which can improve the semantic expression of images and alleviate the semantic gap.The main work is as follows:(1)Introduced in detail the extraction methods of global image features and text features;(2)Designed a text-image retrieval method that fused image salient features.The salient feature extraction network has a simple structure but obvious effects.Make the overall model too large.Experiments are conducted on two commonly used public data sets.The experimental results show that the proposed model is more robust in large-scale retrieval and has better practicability for special application backgrounds such as surveillance retrieval.(2)Aiming at the "semantic gap" and the "heterogeneity difference" between modalities,based on the previous stage,the self-attention mechanism is first introduced in the text feature extraction stage to better learn the grammatical and semantic features in sentences.Thereby distinguishing the importance of words in sentences and alleviating the semantic gap.Secondly,a bimodal cross-gating is designed in the feature fusion stage,which can reduce the heterogeneity difference between the text and the image,and make the query sample and the target sample closer.The main tasks are as follows:(1)Using self-attention mechanism to learn the grammatical and semantic relevance between vocabulary,make the text semantic information more accurate,and alleviate the semantic gap.(2)Take two identical bimodal cross gates to generate features with fewer modal differences for images and texts,and reduce heterogeneity differences,so as to obtain random queries suitable for any two modalities.Experiments were performed on the two proposed improvements in two commonly used public data sets.The final results show that the two proposed improvements improve the performance of text-image cross-modal retrieval,and the performance of the overall model proposed in this paper is comparable to The latest research has achieved comparable results and is more robust in large-scale retrieval.
Keywords/Search Tags:Cross-modal retrieval, Deep learning, Feature extraction, Similarity evaluation, Attention mechanism, Gating mechanism
PDF Full Text Request
Related items