| With the rapid development of deep learning approaches,image classification tasks have made great progress.However,deep learning is a data-dependent learning paradigm that requires a large amount of human-labeled data to train models well.In the real world,it is impossible to perform effective manual annotation for tens of thousands of categories,which is much laborious and time-consuming.Moreover,for some rare categories,it is still difficult to collect enough samples or even no sample for training models.Therefore,Zero-shot Learning(ZSL)has gradually been concerned by people.And its goal is to solve the problem of image classification when the amount of target task training data is zero,which is also called the problem of zero-shot image classification.Zero-shot learning imitates the ability of human reasoning.It uses auxiliary semantic information to build the bridge between seen and unseen classes,which helps construct the cross-modal mapping of visual information and semantic information in the embedding space.Finally,it achieves the semantic alignment of cross-modal information and finishes the task of classifying unseen class samples.However,the visual information and semantic information face semantic gap problem due to their own modalities,and it is difficult to establish a good cross-modal semantic alignment in the embedding space.Therefore,this thesis constructs an active attention mechanism and a contrast constraint method to make the cross-modal semantic alignment in the embedding space.The main research contents of this thesis are as follows:First,this thesis proposes a method of hybrid routing transformer for zero-shot learning.Existing zero-shot learning methods based on attention mechanisms ignore the semantic gap,which is caused by the different modalities between visual information and semantic information.They all use passive attention mechanisms to weigh the two modalities without narrowing the modality differences.They are unable to capture the true attribute-relevant visual regions and cannot perform the true cross-modal semantic alignment in the embedding space.To this end,this thesis proposes an active attention mechanism,which achieves the active guidance of semantic features and the active learning of visual features in the embedding space by using the dynamic routing of two capsule networks in a top-down and bottom-up combined manner.It helps obtain the better semantic-aligned visual features,reduces the modal difference between visual information and semantic information,and finally establishes an effective cross-modal semantic alignment relationship.In addition,this thesis earlier builds a transformer-based zero-shot image classification framework and finishes it in the form of encoding and decoding,which provides an important solution for future research.Experiments show that our proposed method can effectively alleviate the modality difference,achieve semantic alignment of cross-modal information and largely improve the performance of zero-shot image classification.Second,this thesis proposes a method of contrastive constraint embedding for Zero-short learning.Existing zero-shot learning methods based on embedding space ignore the exploration of semantic attributes themselves,which lack of the discriminative and robust feature representation of semantic attributes on visual features,and finally cannot achieve effective cross-modal semantic alignment well.To this end,this thesis introduces contrastive learning to the embedding methods.We make the positive semantic attributes predicted of different samples in the same class as similar as possible,and the negative semantic attributes predicted of different classes as far away as possible,so as to improve the ability of discriminative and robust semantic attributes feature expression for vision features in zero-shot image classification tasks.In addition,by introducing the mean teacher mechanism,this thesis constructs a consistency constraint between the student model and the teacher model with different data augmentation inputs,which further improves the robust mapping of cross-modal visual information and semantic information in the embedding space.Experiments show that our proposed method can not only effectively improve the ability of attribute feature expression for vision features but also achieve the establishment of good cross-modal semantic alignment.Furthermore,our proposed method can greatly improve the accuracy of zero-shot image classification.In the end,this model has achieved state-of-the-art performance in the task of zero-shot image classification. |