| Object tracking aims to lockup a specific target in video sequences.As a basic task in the computer vision field,object tracking has attracted increasing attention with many potential applications,such as surveillance video,mobile robot,and auto-driving,etc.There are two groups of object tracking(i.e.,passive and active tracker)according to the goal of the tracker.Most existing works are on passi ve tracker,where they estimate the bounding box of an object consistently through a sequence of frames by utilizing the initial information.While active tracker needs autonomously control the camera’s motion and posture to adaptively lockup the target.In this thesis,we focus on the target template update in passive object tracking and target representation learning in active object tracking.The main contributions of this work are twofold:For passive object tracking,we propose a temporal correlation and channel decorrelation framework to update the target template based on Siamese network.As the representative method in passive trackers,Siamese-based trackers initialize the target template in the first frame and keep it fixed,which will convolve with the deep feature of the search region for matching during subsequent frames.However,this fixed template feature fails to adapt to the target appearance changes such as scale variation,partial occlusion,and illumination variation,etc.To alleviate these issues,on one hand,we consider the channel-wise correlations between the initial and historical template features to adaptively aggregate informative channel-wise representations for template update.On the other hand,we propose a decorrelation regularization to weaken the channel-wise correlations of individual template features.By end-to-end training,we learn a more complete and adaptive template for accurate object tracking.Extensive experiments on seven benchmark datasets verify the effectiveness of our method.For active object tracking,we propose an end-to-end anti-distractor active object tracking framework in 3D environment.Active trackers aim to control the camera’s motion to keep tracking the target by taking visual observations as input.Previous works on active tracking assume that there is only one object(person)in the environment without distractors.In this work,towards the realistic setting,we move forward to a more challenging scenario,where the tracker moves freely in 3D space to track a person in various complex scenes with multiple distractors.To this end,on one hand,we take the target template to learn an embedding as channel-wise attention for current observation to distinguish the target from the distractors.On the other hand,temporal attention is introduced to fuse the observation history to extract a feature representation,which is then fed into a reinforcement learning network to output the action of the tracker.To evaluate our method,we employ Unreal Engine to build several multi-object 3D environments and extensive experiments demonstrate the effectiveness of our approach. |