Font Size: a A A

Research Of Spatiao-temporal Attention Mechanism For Weakly Supervised Object Detection And Segmentation

Posted on:2022-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q EnFull Text:PDF
GTID:1488306764494144Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,images and videos have become essential digital media information.Extracting adequate semantic information from images and videos has become a hot research topic in computer vision.Most current deep learning-based methods usually use data with many explicit task labels for training and learning.However,specific and specialized scenarios lack task-relevant fine-grained labeled data to learn high-quality visual information.In this paper,we focus on the incomplete and inexact supervision problems in the weakly supervised problem,i.e.,training data with only coarse-grained labels and only a portion of training data with labels.There are usually two practical applications:(1)images and videos have only category labels;(2)images and videos have only offline labels.These two complex scenarios often result in models with difficulty establishing a direct link between data and weak labels and cannot adapt to changes in online targets' appearance.Inspired by the human visual attention mechanism,this paper focuses on the effective combination of Spatio-temporal attention mechanism and weakly supervised information to enhance the model's target detection and segmentation capability in weakly supervised scenarios.First,a human-like delicate region erasing strategy for weakly supervised detection and localization is proposed for the scenario with only image category labels.Existing deep learning algorithms iteratively select targets from many candidate frames,which generates a large number of redundant computations and does not conform to the human visual selection attention mechanism.Therefore,this part constructs an anthropomorphic attention mechanism by a reinforcement learning algorithm to learn the implicit relationship between input data,weakly supervised labeling,and target.Starting from the feature maps generated by the weakly supervised labeled data-driven neural network model and the contribution of salient target regions to the classification confidence,iterative attention is paid to the salient object regions,and the most salient regions with high contribution to the classification confidence are selected as the visual attention selection regions.The method proposed in this part can effectively mimic the human visual mechanism.Experimental results on two publicly available datasets show that it can achieve comparable results to other deep learning methods while significantly improving detection efficiency.Second,a multisource-saliency and exemplar mechanism for weakly supervised video object segmentation method is proposed for only video category labeling scenarios.The existing methods fail to combine Spatio-temporal information in a learned way to generate high-quality pseudo labels and fail to utilize category labels effectively.Therefore,this part performs weakly supervised segmentation networks by learning Spatio-temporal salient regions and exemplar sample adaptation from multiple sources of saliency knowledge.In this part,a multi-source saliency module and a Spatio-temporal exemplar adaptation module are constructed to extract the typical relationship between the temporal and spatial domains using visual saliency a priori and make full use of the category information combined with collaborative semantics for the segmentation task by deep neural networks.Experimental results on three video target segmentation public datasets show that the method proposed in this part mimics the fusion process of human multi-source cognition and effectively improves the generalization performance segmentation accuracy of the algorithm.Third,an unsupervised video target segmentation method based on a local-global memory mechanism is proposed for the video-only offline mask annotation scenario.Most of the existing methods use optical flows or recurrent neural network methods,but the quality of optical flows cannot be guaranteed in complex scenarios,and recurrent neural networks are difficult to optimize.Therefore,this part considers both local and global memory mechanisms to obtain reliable short-and long-time video inter-frame correlation information at the same time,which enables unsupervised video target segmentation.The global and local memory modules accomplish the unsupervised video segmentation task in a macro-to-micro paradigm through a collaborative mutual attention mechanism and a graph convolutional network.The results on three video target segmentation public datasets show that the local-global memory mechanism proposed in this section can effectively improve the segmentation accuracy of the algorithm.Finally,a mask-guided self-feedback mechanism-based object segmentation method is proposed for the offline mask annotation scenario with only images and videos.Existing methods extract spatial semantic information by designing different feedforward network connections.However,due to the limitations of the forward network,the middle layer features are not well quantified toward the task-driven direction,leading to the problem of pre-context confusion.Therefore,this section proposes a mask-guided feedback neural network inspired by visual feedback mechanisms to collaboratively optimize the focusing and feedback network through a focusing-feedback-re-estimation process.The network generates global target features from the mask information generated by the high-level network and subsequently generates feedback re-estimation segmentation results through feature propagation of this feature with the original image intermediate features.The mask-guided feedback neural network in this section is applied to unsupervised video segmentation,video saliency detection,image saliency detection,and image semantic segmentation tasks are able to effectively improve the accuracy of the benchmarks and achieve the best results so far in each task.
Keywords/Search Tags:computer vision, weakly supervised learning, video object segmentation, memory mechanism, feedback mechanism
PDF Full Text Request
Related items