| With the continuous promotion of deep learning technology in the field of artificial intelligence,fully convolutional neural networks have made significant progress in instance segmentation and its various sub directions,but there are still many problems that need to be solved urgently.In the referring image segmentation task,text data needs to be input into the network simultaneously due to the need to specify a specific entity in the text.This leads to the need to reasonably combine two different characteristics of features in referring image segmentation tasks,and puts forward stricter requirements for matching methods that guide text information and image information more closely.At this stage,the two main improvements are to find new ideas for multi-level,multi-scale and multi-feature fusion and explore the guiding role of text information on tasks.This article explores the guiding role of text based on the experience of previous researchers,proposes a method of using text nouns to guide image local feature generation to increase model details and help segmentation mask thinning,so as to improve task accuracy.The work of this paper mainly includes the following three parts:(1)To address the issue of accurately aligning visual and linguistic features.This article uses an attention mechanism based progressive understanding module to gradually input text features based on network adaptive analysis of text parts of speech.The attention mechanism is used to gradually understand and focus on the entities referred to in the text,thereby gradually guiding the model to segment the entities referred to in the text.(2)Addressing the lack of local features in referring image segmentation tasks.This article analyzes the reasons and proposes a method for noun guided local feature extraction.Propose a local feature generation module that utilizes pre acquired nouns to guide the network to adaptively extract local features,increasing the details of the network,and demonstrate its effectiveness in relevant experiments.(3)Aiming at the problem that the edge of the model prediction mask is not detailed enough,on the basis of the local features generated by the module,this paper uses the local feature weight to reconstruct a decoding correction module to enrich the edge information through local features and low-level semantic features to make the segmentation of the output prediction mask more precise.The method in this paper has carried out extensive experiments on the three data sets of Ref COCO,Ref COCO+and G-ref,and achieved 66.72%,56.62% and 60.39% segmentation accuracy on the verification set of the three data sets.Compared with other advanced methods,the performance of the three data sets has improved,and the improvement of the G-ref data set with larger text length and complexity is more obvious,which reflects the effectiveness of the proposed model. |