| Traditional image classification tasks usually require the use of large amounts of labeled data to train models,but in real life,data collection and labeling are very difficult.Therefore,zero-shot learning algorithms on how to recognize objects with no samples have become a hot research topic.Zero-shot learning aims to solve the classification problem in the absence of samples by using class-level semantic information to establish connections between seen and unseen classes and thus achieve the recognition of unseen classes.Most of the existing zero-shot learning algorithms use deep convolutional networks pre-trained on Image Net to extract features,which ignores the inconsistency of distribution between Image Net and the zero-shot learning benchmark dataset.Aiming at this problem,this thesis uses Swin Transformer as a new backbone network and applies it to the zero-shot learning field for the first time,input original images to obtain visual features based on semantic information using a self-attentive mechanism,and proposes two embedded zero-shot learning algorithms on this basis.The main research work is as follows:(1)An embedded zero-shot learning algorithm based on multi-label semantic guidance is proposed.The algorithm calculates the similarity between the semantic space of seen and unseen classes simultaneously when constructing the embedding space of visual features and semantic information on seen classes,and guides the model to consider the unseen classes that are semantically similar to the current seen class,and then migrates the similarity of the semantic space to the embedding space where classification is finally performed,which alleviates the domain shift problem and thus achieves more accurate classification.(2)An embedded zero-shot learning algorithm based on multi-scale feature fusion is proposed.The rich attribute features are extracted from the images using the hierarchical structure of Swin Transformer,and then the attribute features are aligned with the attribute prototypes to optimize the whole network so that the global features contain more detail information to distinguish fine-grained semantic attributes,alleviating the problem of insufficient detail characterization ability brought by past methods using image depth features.Meanwhile,in this thesis,relevant experimental validations of the proposed two algorithms are conducted on the bird dataset(CUB),scene dataset(SUN)and animal dataset(AWA2),and the results show that both algorithms can achieve good zero-shot classification results. |