| In visual perception,meaningful concepts are derived from raw pixels by aggregating visual units that form specific patterns,enabling perception as a whole.Image segmentation,the process of grouping pixels in an image to form non-overlapping regions,is a challenging problem in computer vision and serves as the foundation for numerous visual tasks.It has a wide range of applications in various fields such as robotic grasping,photo editing,autonomous driving,medical diagnostics,fingerprint recognition,and remote sensing mapping.Early image segmentation algorithms primarily relied on Gestalt principles,such as similarity and proximity,to perform category-agnostic image segmentation with little or no annotation.With advances in datasets and deep learning,current supervised segmentation models have significantly improved performance and can predict semantic categories.These models have achieved human-level segmentation performance on multiple public datasets,but they require large numbers of pixel-level annotations and cannot generalize to unknown categories or data distributions.To this end,in this thesis,we are going towards the more realistic scenarios of image segmentation with insufficient data annotations and unknown testing categories,as well as the unknown testing distribution of data samples.The main contributions of this thesis are summarized as follows:(1)We present Deeply Unsupervised Patch Re-ID(DUPR),a simple yet effective method for unsupervised visual representation learning,tailored for image segmentation tasks.To bridge the gap between the image-level unsupervised pre-training and the downstream tasks of image segmentation,the thesis extends unsupervised contrastive learning to the local level.Specifically,DUPR treats corresponding local regions(i.e.,patches)in two views as positive pairs and patches from other images as negative pairs,learning the correspondence between the two views using contrastive learning.The pre-trained model can generate discriminative local features for downstream tasks related to image segmentation.Moreover,the unsupervised loss is applied to multi-scale feature maps to better align with the image segmentation model structure,since image segmentation tasks typically require multi-scale predictions.Extensive experiments demonstrate that DUPR outperforms state-of-the-art unsupervised pre-training and even supervised pre-training in various downstream tasks related to image segmentation.(2)We propose a decoupling formulation for zero-shot semantic segmentation.Zero-shot semantic segmentation is decoupled into two subtasks: 1)a category-agnostic clustering task to group pixels,and 2)a region-level zero-shot classification task.The former task does not involve category information and can directly group pixels of unseen categories to form category-agnostic mask regions.The latter task performs regionlevel classification,which can better utilize large-scale visual-language pretrained models(such as CLIP).Based on the decoupling formulation,this thesis presents a simple yet effective zero-shot semantic segmentation model with transformer,named ZegFormer,which significantly outperforms previous methods on the zero-shot semantic segmentation benchmarks.Due to the limited number of categories in previous zero-shot semantic segmentation benchmarks(up to 171),this thesis additionally proposes a new benchmark based on ADE20K-Full,featuring 847 categories.On this new benchmark,ZegFormer’s performance is close to that of supervised segmentation models.(3)This thesis proposes a novel hierarchical grouping transformer(HGFormer)to explicitly group pixels to form part-level masks and then whole-level masks.Compared to per-pixel classification models,HGFormer adopts mask classification to obtain more robust classification predictions.In contrast to flat grouping mask classification models,HGFormer can predict more reliable masks.Additionally,masks of different scales can be classified to obtain semantic segmentation results.Subsequently,the classification results of masks at two scales are combined to achieve more robust semantic segmentation outcomes.This thesis constructs multiple cross-dataset image semantic segmentation experimental settings using seven public semantic segmentation datasets.The experimental results show that the hierarchical grouping model HGFormer produces semantic segmentation results that are more robust than those of per-pixel classification models and flat grouping mask classification models,significantly outperforming previous methods.This thesis aims to tackle the limitations of deep learning-based image segmentation models by focusing on unsupervised pre-training,zero-shot semantic segmentation,and domain-generalized semantic segmentation,achieving notable breakthroughs in these areas.The accomplishments significantly reduce the requirements for manual annotations during training while enhancing the generalization ability of image segmentation models when dealing with unknown categories and data distributions.The research presented in this thesis takes a small yet substantial step toward bringing image segmentation closer to real-world applications.Furthermore,the models and algorithms proposed in this thesis provide valuable insights for future algorithmic research in other computer vision tasks. |