| Visual object tracking is a fundamental task in computer vision,which can be applied to a wide range of applications,e.g.,automatic driving,human–computer interaction,and video surveillance.Given the initial target state,a visual tracker is required to locate the target object in successive frames.In recent years,deep learning based visual tracking has made significant progress.Nevertheless,to implement a robust visual tracker has still been commonly recognized as a very challenging task due to numerous complex factors,such as appearance change,occlusion and motion blur.This work conducts comprehensive research on three folds: target localization,target size estimation and feature extraction.The main contributions of this work are as follows:(1)For target localization,we propose a novel dual path network with discriminative meta-filters and hierarchical representations.We adopt powerful discriminative metafilters which can be trained online by gradient-descent algorithm to find coarse location of target.Then,we fuse the response map of multiple filters and exploit the hierarchical feature representations to achieve more robust tracking performance.(2)For size estimation,this work proposes a novel Space Time Memory Network(STM)based target segmentation branch to handle the challenging factors,such as occlusion and appearance variations.STM employs the non-local attention mechanism to encode each sample into a triple set(query,key,value).It further constructs dense correspondence between search area and stored memory samples with attention mechanism.Thus,STM based segmentation branch exploits the temporal information for accurate size estimation,which can better capture the appearance variations and the fine-grained differences between targets and distractor objects in the videos.(3)This work beyond explores the two subtasks in tracking,i.e.target localization and size estimation,but also contributes to the font-end feature extraction.We make the first attempt to apply fully attention-based Transformer network into the feature extraction in visual tracking.The features extracted by the Transformer network are learned from matching which are ultimately for matching,raising the tracking performance by a large gap.We also design several specialized modules which can reduce the computation Flops in vision Transformer.The proposed tracker achieves the state of the art performance while running at real time. |