Semi-supervised video object segmentation is an extremely challenging task and has broad application prospects in video editing,autonomous driving and other fields.However,complex situations such as deformation,occlusion,and rapid-motion,etc,often exist in videos,which limit the segmentation speed and accuracy of existing methods.To solve this problem,this thesis proposes a semi-supervised video object segmentation model based on multi-level target appearance information.On this basis,a semi-supervised video object segmentation model based on spatio-temporal memory network is designed to solve the problem of insufficient segmentation stability when dealing with local information confusion.The main work and contributions of this thesis are as follows:(1)In order to segment target objects in video sequences with high speed and accuracy,this thesis proposes a semi-supervised video object segmentation based on Multi-level Target Models and Feature Integration(MTMFI).Firstly,a multi-level target appearance model composed of a light-weight convolution structure is used to enrich the target appearance details and ensure the segmentation inference speed.Besides,a feature integration module is designed to capture the dynamic changes of the target object between different video frames and further improve the segmentation accuracy.The model can achieve the trade-off between segmentation speed and segmentation accuracy,and achieve accurate segmentation of target objects at a higher inference speed.(2)In order to solve the problem of the degradation of segmentation performance of most methods when dealing with local information confusion,this thesis proposes a semi-supervised video object segmentation based on Spatial-Temporal Memmory network with Top-K filter and ASPP(TA-STM).Firstly,the Top-K filtering mechanism is added to the spatiotemporal memory network to filter the global noise and capture the local similarity of the target objects.At the same time,an atrous convolutional spatial pooling pyramid module is added to prevent the loss of local information while capturing the appearance information of multilevel target objects.The model can ensure segmentation stability,and its segmentation accuracy is not be significantly affected by complex factors.The methods proposed in this thesis have been experimentally verified on the video object segmentation datasets DAVIS-2016,DAVIS-2017 and You Tube-2018,and sufficient experimental comparisons show that both methods achieve competitive results. |