With the continuous development of deep learning technology,major breakthroughs have been made in computer vision(CV)technology,various algorithm models have been introduced,and various performance indicators have changed with each passing day.Image classification is a fundamental task of computer vision,and important breakthroughs have also been made in recent years.Different from common image classification tasks,fine-grained image classification is dedicated to determining the specific sub-categories of the target subject,and it has the characteristics of large differences between the same type and small differences between different types,which increases the difficulty of image classification.Therefore,locating a representative local area in an image and extracting rich feature information from it,thereby improving the classification performance,is the mainstream scheme of the FGIR algorithm.In recent years,based on the proposal of the convolutional neural network(CNN)and the Transformer structural network model,the optimal classification accuracy has been continuously refreshed on various benchmark datasets.Compared with CNN-based work,Transformer network achieves better performance.This paper explores the application of Transformer in FGIR tasks,and proposes two network models based on Transformer structure processing,guides the network to pay more attention to local features with high discrimination,effectively improves the network’s ability to locate high-discrimination local regions and extract rich feature information,and improves the applicability of the model in FGIR tasks.Experiments are carried out on multiple public benchmark datasets,and good classification accuracy is obtained,which proves the effectiveness of the method.The innovations of this paper are as follows:(1)This paper proposes an attention grouping algorithm(GA-Trans)based on Transformer to extract high-discriminatory features and secondary features for fine-grained image classification.Vi T has the defect of over-focusing on global information and failing to make full use of highly discriminative local features.In this paper,the attention weight of each image block in the original image is calculated by multiplying the self-attention transfer matrix of each layer in the Transformer,and some image blocks with high weight are reserved,and the adjacency relationship of the remaining image blocks is used as the division of the enhanced image.Based on this,the model can perform targeted learning on high discriminative features and secondary features.Experiments demonstrate that GA-Trans achieves the best classification accuracy on 3 public fine-grained benchmark datasets.(2)This paper proposes a fine-grained classification algorithm based on Transformer multi-scale feature fusion.The key task of FGIR lies in local detail localization and feature extraction,so the granularity of the input image patch greatly affects the model performance.The previous work has the defect that the granularity of input image blocks is too large.In order to improve the learning ability of image detail features,this work uses Swin Transfromer(Swin-T)that supports near-pixel-level image block input as the backbone network.The multi-scale features are screened,and these effective features are fused to improve the quality of the final classification features.Experiments show that the algorithm improves the model’s ability to locate and extract details of differences,and achieves the best classification accuracy on multiple datasets. |