Font Size: a A A

Deep Hash Image Retrieval Based On Vision Transformer

Posted on:2024-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:C HuangFull Text:PDF
GTID:2568307178973879Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of Internet technology and mobile devices,more and more images and videos are uploaded to the Internet.Facing massive image data,it becomes increasingly important to retrieve images quickly and accurately.The image retrieval model based on deep hashing algorithm encodes images into a fixed-length hash code through hash learning,enabling fast retrieval and matching.This addresses the problem of high complexity and low efficiency in image retrieval in the era of big data.There are two key points in the deep hash retrieval algorithm: feature extraction network and hash learning.The performance of the feature extraction network determines the representation ability of image features and the modeling ability of semantic information of the retrieval model,while the quality of the hash learning algorithm determines the discriminative power of the generated hash codes,further affecting the accuracy of hash code matching.Previous deep hash retrieval models used convolutional neural networks to extract local information from images using convolution and pooling techniques and needed to continuously deepen the network layers to obtain global longrange dependencies,which brought high complexity and computational cost.However,the visual Transformer model based on self-attention can effectively learn long-range dependencies in images and has shown excellent performance in various image tasks.In view of the above problems,this paper studies the two key points of the deep hash image retrieval algorithm:1.We design an attention-enhanced visual Transformer image retrieval network-AEVi T.To address the problem that the visual Transformer can effectively learn long-range dependencies of image features but cannot efficiently model local spatial features of images,an attention-enhanced module-AEM is designed in the proposed AE-Vi T to capture the local salient information and visual details of the input feature map,learn the corresponding weights to highlight important features,enhance the representation power of image features input to the Transformer encoder,and improve the convergence speed of the model.Experiments are conducted on two benchmark datasets under different hash code lengths,comparing AE-Vi T,Alex Net,and Res Net as backbone networks.The effectiveness and superiority of AE-Vi T in image retrieval tasks are verified,demonstrating the performance advantage of retrieval models based on the visual Transformer architecture over those based on pure convolutional neural network architecture in image retrieval tasks.2.Based on the feature extraction network proposed in this paper,four image retrieval models based on classic deep hash loss and one image retrieval model based on joint loss are further designed.On the one hand,through comparative experiments,the superiority of AE-Vi T in image retrieval performance under different deep hash loss functions is verified.On the other hand,in view of the problem that classification label information is not fully utilized,a contrastive loss function combining classification loss-HSC-Loss is proposed.Experiments comparing various classic deep hash retrieval methods and Transformer-based hash retrieval methods verify the superiority of the deep hash image retrieval algorithm based on the visual Transformer proposed in this paper.
Keywords/Search Tags:Image Retrieval, Vision Transformer, Deep Hashing, Attention Module
PDF Full Text Request
Related items