| Clothing is one of the most important and basic need of a person and occupies a relatively important position in daily life.The categories and quantities of clothing are also showing an explosive growth trend.In recent years,with the rapid development of e-commerce,how to accurately and quickly classify a large number of clothing images is an urgent problem to solve.In the face of massive data,manual labeling is time-consuming and labor-intensive.However,the traditional clothing classification method based on machine learning is very dependent on the selection of features,and is also easily affected by human body posture,illumination,deformation,complex background and other factors and the classification accuracy is relatively low.Therefore,how to use deep learning methods to efficiently classify clothing images is an extremely important topic in the field of computer vision.In this paper,we propose an improved clothing classification algorithm based on enhanced fusion model,local labeling and effective region selection to focus on the fact that the convolutional neural classification network cannot capture long-distance dependence within clothing images while ViT-based models are lack strength of fully utilize local information and complex background will transmit noise to the classification task.The research content of the thesis mainly includes:(1)In view of the fact that the existing CNN-based clothing classification method cannot capture long-distance dependencies due to the limited receptive field of the convolution kernel structure,a hybrid model based on Res Net and ViT to get both global and local features is proposed.Res Net is used to extract shallow local features,and an efficient channel attention module(ECAM)is added to improve its residual block structure so that sigificant features can be enhanced.Then,the feature maps with attention weights is sent to the ViT to establish the longdistance dependence of clothing images,which effectively make up for disadvantage of convolutional structure.(2)To deal with the problem that ViT-based models fails to make full use of the local features generated in the fusion process and cause information waste,local labeling module(Token Labeling Module,TLM)is proposed.The TLM module can transform the recognition problem of clothing images into the recognition problem of multiple sub-regions,use a strong classification network to generate soft labels for each sub-region,and achieve intensive supervision of the training model by improving the loss function.This method can effectively utilize the rich information contained in the local area generated by the ViT network,and realize the classification of clothing images by extracting the semantic information in clothing images.(3)Aiming at the problem that the background of clothing images are so messy that the feature maps will be mixed with noise when passed down,an Effective Region Selective Module(ERAM)is proposed.The ERAM module uses the self-attention mechanism to obtain the weight of each region.By discarding the region with the smallest weight,it effectively reduces the influence of the image background and human torso that have nothing to do with the clothing category on the hybrid classification model.In this paper,the effectiveness of the proposed classification algorithm is verified on the Deep Fashion dataset.The experimental results show that the clothing classification method based on the attention mechanism and tokenlabeling proposed in the paper can effectively improve the accuracy of clothing classification.The accuracy of clothing image recognition based on the hybrid model of local marking reached 87.49%,which is 1.13% higher than that of the baseline network.The accuracy of the clothing classification method based on channel attention and effective region selection reaches 87.55%,which is 0.7% higher than the baseline network.The experimental results show that the method proposed in this paper can effectively extract the local and global semantic information contained in the clothing image,and at the same time reduce the influence of clothing background and other irrelevant areas on the clothing classification accuracy to a certain extent. |