| Since the 2010s,with the rapid development of Internet technology,the ways of communication between people have gradually become more diversified,among which video has become a very vital medium of information transmission.Video conferences,live broadcast and online classes made all walks of life achieve more efficient and convenient work through video.At the same time,a large amount of video data was generated.At present,with the rise of deep learning,how to make better use of video data has become a significant topic for researchers.Video Object Segmentation(VOS)is a very important research direction in computer vision tasks,which aims to track and segment interested objects from continuous video sequences.The research content of this thesis is semi-supervised VOS,that is,in the model test stage,the ground truth mask of the first frame is used to specify the objects,and the objects are continuously tracked and segmented in all subsequent frames.At present,a mainstream semi-supervised VOS idea is to compute output feature based on feature similarity between current frame and reference frame(memory frame).Because there may be similar pixels between different objects,between foreground and background,it will be difficult for the model to distinguish the objects and non-objects.Thus,this thesis proposes a VOS method based on structural Transformer.According to the ideas of image de-averaging and disentangled representation,the feature similarity is divided into the time-space part and the object significance part.The former ensures that the difference between pixels can be investigated when calculating the time-space relationship,while the latter enables the model to mine the main features inside the objects.In many semi-supervised VOS solutions,mask is utilized in two ways:encode it into embedding;directly downsampling.Both of these ways take advantage of the semantic information of the mask,but the objects’location information of the mask is not fully utilized.Therefore,this thesis proposes a VOS method based on region strip attention and Hadamard product,which uses the memory masks to obtain the exact region of the objects,and calculates strip pooling in the target region of every memory frame,so as to improve the model’s attention to the object region and reduce the attention to the non-object region.This thesis only use DAVIS 2017 and Youtube-VOS 2019 training sets for training,compared with other methods that use extra static image datasets and tremendous video dataset BL30K,it saves a lot of storage space and time for data collection and collation.The methods have great accuracies,which exceed the baseline model by 1.5%and 1.6%respectively in the DAVIS 2017 validation set,exceed the baseline model by 0.5%and 2.3%in the DAVIS 2017 test-dev set,the first method exceeds the baseline model in the Youtube-VOS 2018/2019 validation set by 0.5%and 0.4%,while the second method exceeds by 1.4%and 1.6%.At the same time,the two methods also have an advantage in running speed,which exceeds many methods.Through comparative experiments and ablation studies,the effectiveness and advancement of the models are proved from different perspectives. |