Font Size: a A A

Research On Image Classification Algorithm Base On Vision Transformer

Posted on:2024-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z F ZhuFull Text:PDF
GTID:2568307067973509Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The essence of image classification task is to distinguish different categories of images according to the semantic information of images.It is the core of computer vision and the basis of other high-level visual tasks such as object detection,image segmentation and object tracking.In recent years,Transformer models based on Self-attention mechanisms have shown significant competitiveness compared to convolutional neural networks,but the existing Vision Transformer models still have shortcomings in image classification tasks.Therefore,in order to further improve Vision Transformer’s performance in image classification,this paper mainly studies the following two aspects:This paper proposes a multi-scale feature fusion image recognition framework FFT(Feature Fusion Transformer)based on Vision Transformer.In the original Vision Transformer model,the model did not make full use of the internal structural information of a single image block,making it difficult for the model to learn local detailed features of high-resolution images.In order to make up for this deficiency,this paper extracts the feature maps output by different convolution layers in the convolutional neural network,and designs an FFT Block to merge the embedding vectors of different feature layers under the same receptive field,successfully fusing the image feature information under different scales.Relevant experiments show that this framework can pay attention to more detailed features in the image,provide more abundant feature information for image recognition,and effectively improve the accuracy of image recognition.The accuracy of Top-1 on the Tiny Image Net,Cifar10,and Cifar100 datasets has been improved by 6.5%,3.7%,and 7.8% compared to the baseline model,respectively.In this paper,we propose an Efficient Self-attention(ESA)module for optimizing computational complexity and a LE(Locally Enhanced)module for locally enhanced.Due to Dot-Product Attention,the computational complexity of the Vision Transformer model is too large,greatly slowing down the inference speed of the model.This high computational complexity and time complexity have become bottlenecks in the development of the model.In order to optimize the computational complexity of Vision Transformer,the ESA module designed in this article ranks the attention intensity of Class Token and Patch Token,only calculating tokens with higher attention intensity.Additionally,a cross layer reuse design of attention matrix is added to further reduce the computational complexity of the model.The parallel convolutional LE module is also added in this paper,and the convolution calculation can capture local information,combined with the global information captured by Self-attention,to provide more feature information for image recognition.Related experiments have shown that on the Tiny Image Net dataset,the combination of ESA module and LE module results in a 1.2% improvement in Top-1 accuracy compared to the baseline model,and a 19% improvement in inference speed.
Keywords/Search Tags:Image Recognition, Vision Transformer, Multi-scale Feature Fusion, Efficient Self-attention, Locally Enhanced
PDF Full Text Request
Related items