| Semantic segmentation is an important task in computer vision recognition.Its goal is to identify the categories corresponding to each pixel in an image.Recently,Convolution neural network performs well in solving this problem.However,the training of deep convolution neural networks requires a large amount of training data with pixel-level annotations.This pixel-level annotation requires manual labeling of each pixel in the image,Therefore,in the semantic segmentation task,the annotation of datasets is costly and difficult to obtain.The development of image semantics segmentation is often hampered by the problem of data cost.Toward this end,researchers have proposed weakly supervised semantic segmentation.The goal is to solve this problem by using weakly labels such as image-level labels and bounding boxes.These weakly annotations do not need to accurately label each pixel.So weakly annotation labeling is cheaper and easier to obtain.However,weakly supervised semantic segmentation also faces some challenges,such as the use of incomplete weakly tags,which often lack precise object region information for semantic segmentation.It is therefore difficult for models to learn this information In addition,the dimension of semantically segmented weakly tags is often much smaller than that of semantically segmented strong tags.In order to make the output of the model correspond to the dimension of the weakly label,most of the current weakly supervised semantic segmentation models are based on the image classification model.Although these models can use the weakly label for supervised training,they usually focus on a small part of the semantic object,and the output segmented map tends to be over-convergent locally.For this reason,the purpose of this paper is to study a high-precision weakly-supervised semantic segmentation method.This study mainly faces the following two challenges:1)The low accuracy of the model is due to the lack of image information.Therefore,how to effectively supplement the information required by the model is the first challenge to be addressed in this paper.2)How to design an effective mechanism of feature propagation to prevent model over-convergence and force the model to focus on the whole area of the semantic object is the second challenge facing this paper.To address these challenges,we propose a weakly supervised semantic segmentation method based on multimodal fusion,which is referred to as MFWS for short.The model consists of two components:generating the model and refining the model.Contrastive Language-Image Pre-Training and Convolutional Neural Networks are selected for the generation model and thinning model,respectively.The purpose of generating models is to generate high-quality pseudo-segmented icon labels by fusing image and text information.Thinning models further correct details based on pseudo-labels to improve the accuracy of segmented results.Specifically,on the basis of the existing multimodal pre-training techniques,we construct a related word bag by searching the Related words of each tag to force the pretraining model to focus on all parts of the object.At the same time,we designed a feature refinement module based on the pseudo-thinning model,which uses the low-dimensional and high-dimensional image correlation matrices as the label propagation matrices of the feature graph,thus suppressing the local convergence of the model.We have done a lot of experiments on the PASCAL VOC dataset.The experimental results demonstrate the validity and advancement of the proposed method. |