Font Size: a A A

A Method For Multi-target Automatic Video Object Segmentation

Posted on:2023-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:S ShaFull Text:PDF
GTID:2568306827967519Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Automatic video object segmentation(AVOS)has played an increasingly important role in recent years,which can be applied not only to video conferencing and autonomous driving that require specified categories,but also to video understanding of everything in the world.However,this task is facing great challenges in instance segmentation and time continuity due to complex backgrounds and changeable appearances of objects.Existing AVOS methods are facing the common problems such as semantic ambiguity between similar objects,missing objects in complex scenes and so on.To solve these problems,this paper improves the ways of spatio-temporal merging in semantic-level.Firstly,the work proposes an efficient end-to-end multi-target AVOS model--a flexible learning positioning-and-modification model based on spatio-temporal bi-branches.The model takes the idea of centroid location in SOLOv2 as the basis,and aims to more accurately obtain the centroid position of independent objects.Due to the spatio-temporal interactions of videos,the network is flexibly designed to concentrate on temporal matching and self-excavation of features,respectively.Using only video frames as inputs,the bi-branches network can independently split the learning of appearance features within a single frame and motion information between frames,while providing a more flexible input for the followings which are spatio-temporal fusion,category prediction,and segmentation prediction.Moreover,after the merging of spatio-temporal context,the AVOS model goes through the semantic optimization module to correctly alleviate problems of accumulated error and semantic overlap generated before,in order to locate the individual objects accurately,which in turn leads more accurate prediction of segmentation masks.What’s more,the model replaces convolution layers with Transformer module as the mechanism of attention in the network,along with the extra input called global embedding which successively improves the self-inter-attention with the fusing of features of frames.This bi-branches model based on Transformer verifies its adaption of occlusion,meanwhile,a comparison between two networks is illustrated.In addition,an ID embedding prediction is added with supervision to the model,intending to improve the temporal continuity of the AVOS task.A number of ablation experiments provide convincing for this method.Compared with other existing methods,this work shows competitive performance based on metrics of J and F for video object segmentation and,furthermore,holds the real-time running speed.
Keywords/Search Tags:Automatic Video Segmentation, Multi-target Video Object Segmentation, Semantic Feature Learning, Centriod Location, Mechanism of Attention
PDF Full Text Request
Related items