Font Size: a A A

Research On Image-Text Retrieval Based On Multi-Branch Self-Attention Coding

Posted on:2024-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:M J ZhangFull Text:PDF
GTID:2568307151467144Subject:Communication Engineering (including broadband network, mobile communication, etc.) (Professional Degree)
Abstract/Summary:PDF Full Text Request
With the development of Internet technology and the large-scale growth of various data,it is difficult for people to efficiently and accurately retrieve the information they need.In order to retrieve useful information from such diverse and complex data,cross modal retrieval has become a research hotspot in recent years.However,there is a heterogeneity gap between the underlying representations of multimodal data,which makes it impossible to directly measure the similarity between multimodal data.In addition,the volume of multimodal data is huge,and there are semantic differences between different modes.Therefore,mining immutable information among multimodal data and learning underlying features have become a difficulty in cross modal retrieval.Based on the above issues,this paper studies a cross modal retrieval model,with the specific content as follows:Firstly,in order to better learn the content similarity between multimodal data,a image and text retrieval network based on multi-branch self-attention coding is proposed.The structure approximation of the bidirectional encoder representations from transformers(BERT)model and vision transformer(ViT)model are used to extract text and image features respectively,making it easier to measure the similarity of the extracted features,making it easier for images and texts of the same class to approach,and for images and texts of different classes to stay as far away as possible.Secondly,in order to bridge the heterogeneity gap between modes,and fully retain the semantic information between modes,a dual confrontation image text retrieval network integrating self-attention mechanism is proposed.A generator is used to learn shared features,reconstruct shared representations,and generate pseudo features corresponding to the mode.Then,a discriminator is used to distinguish the authenticity of the reconstructed features and the original features.This generation confrontation mechanism makes the reconstructed feature more and more close to the original feature,so as to better retain the Semantic information of the shared feature.Thirdly,in order to learn discriminant features from multi tags with rich Semantic information,a multi tag image text retrieval method integrating attention mechanism is proposed.This model is based on Graph Attention Networks(GAT)to capture label related dependencies in multiple labels.Through the mapping function of GAT,interdependent classifiers are learned from input word embeddings,and then label classifiers are used to classify the generated common representations.Multi label semantic similarity is used to more accurately describe semantic correlations between and within modalities.
Keywords/Search Tags:image-text retrieval, BERT, ViT, self-attention mechanisms, generative adversarial, multi-label, GAT
PDF Full Text Request
Related items