| Visual object tracking is an important research field in the area of computer vision.This dissertation mainly studies visual object tracking based on a single target.The task of object tracking is to estimate the trajectory information of the target in a video sequence given the initialization information in the first frame.Besides a challenging direction in artificial intelligence,object tracking is closely related to multiple practical applications such as video surveillance,vehicle navigation,motion analysis,and automatic robot navigation.By analyzing the challenging problems in this area,such as the effect of the spatial-temporal relationship in a video sequence on accuracy,estimating the scale of severely deformed target and the influence of position information of pixels in feature maps on target localization,this dissertation explores innovative algorithms and methods to improve the tracking performance from the following several aspects.In a video sequence,the change of the appearance not only has a certain temporal continuity,and the weights of different positions in the spatial region also change continuously,showing obvious spatial-temporal continuity.Existing discriminative Correlation Filters based trackers mainly learn the appearance model by optimizing from spatial or temporal dimensions.To explore both the spatial and temporal continuity,we propose an adaptive spatial-temporal regularized correlation filter based method to improve the discriminative ability of the tracker in both two dimensions.In addition,we combine the multiple deep features and the two-dimensional scale search method to improve the accuracy of scale estimation.We demonstrate our approach on four public tracking benchmarks.For example,we obtain an AUC score of 70.0% and precision of 93.6% on OTB-100 dataset;and we achieve an AUC score of 0.401 and a precision of 0.399 on LaSOT dataset,illustrating the effectiveness of our approach.In high-dimensional features of deep convolutional networks,different spatial locations,as well as channels,have different contributions to the target representation.To enhance the representation ability of the model,we propose a dual attention based Siamese network for visual object tracking.Furthermore,to adapt to the scale changes of the target object,we employ the anchor-free based bounding box regression for predicting the target scale.We conduct comprehensive experiments on four public tracking benchmarks.The experimental results show that our approach achieves the EAO score of 0.551 and 0.443,the Ro score of 0.112 and 0.187 on VOT2016 and VOT2018 datasets,respectively.Furthermore,aiming at the problem of tracking failure with most trackers in the case of the small target object,we propose to enlarge the size of the search image patch to improve the robustness and illustrate the effectiveness of our method on VOT2019.Different from the previous tracking datasets that are annotated with bounding boxes,the objects in VOT2016 and VOT2018 datasets are annotated with the rotated bounding boxes,which approximated the target well,especially for the non-rigid objects.To improve the accuracy of scale estimation for a non-rigid target object,we propose a two-stage deep tracker that combines target object detection and object segmentation.We evaluate on four popular tracking benchmarks.Compared to other methods,our approach achieves the best overall performance on VOT datasets.For example,our method achieves the best EAO score of 0.577,0.514,and 0.390 on VOT2016,VOT2018,and VOT2019 datasets.And our method obtains the lower failure rate of 0.098,0.130,and 0.276 on these datasets.Cross-correlation is commonly used in most deep Siamese network based trackers for target localization.However,the disadvantage of convolution is that the sampling region is fixed according to the size of the kernel.To explore the global relationship between pixels in the image,we propose a Transformer and deep Siamese network based tracking algorithm.The absolute positional information embedding is utilized in the original selfattention module.While the weights of spatial positions for target region and background are different,we explore to combining both the absolute and relative positional information for similarity learning.In addition,to solve the problem that the traditional MLP cannot maintain the spatial structured information,we propose to utilize the large convolutional kernels for designing the classification and regression branches,which benefits the similarity comparison between the template and search patches.The experimental results illustrate that when combining both the two kinds of positional information and using the large convolutional kernels,our model obtains the AO score of 64.0%,improving the tracking performance by a large margin.Finally,this dissertation summarizes the tracking algorithms and gives an outlook for the future research work. |