| Visual object tracking is a fundamental research topic in computer vision,widely used in fields such as robotics,autonomous driving,and human-computer interaction.Its purpose is to predict the state parameters of any tracked object in subsequent sequences based on the initial target state given in the first frame.Despite the efforts of domestic and foreign researchers,visual object tracking technology has made significant research progress and further improved tracking performance.However,due to interference factors such as occlusion,motion blur,geometric deformation,scale and appearance changes,developing a robust,accurate,and efficient object tracking model remains a highly challenging task.To overcome the shortcomings of existing trackers and further improve the robustness,accuracy,and success rate of target tracking,the following work has been completed in this paper:(1)Most Siamese networks based object trackers utilize two independent branches: object classification and bounding box regression.However,there is no information interaction between the two in the tracking optimization process,it is easy to lead to the problem of tracking task mismatch and accuracy inconsistency.In this paper,a generally mutual guidance tracking strategy is proposed.By assigning adaptive weights to classification and regression,mutual complementarity of tracking information between the two is achieved to maintain good classification and localization of the tracked object.It has been applied to several representative trackers to verify its effectiveness and universality.(2)While Transformer brings performance improvement to the tracker,the number of parameters also increases significantly.Transformer-based trackers mostly adopt encoder-decoder structure to exploit the global attention based feature fusion.The flexibility of the tracking architecture and local fine-grained information are still being ignored.To this end,the paper proposes a complementary Dual-attention object tracking architecture,which eliminates the operation of the decoder by encoding the concatenated template features and search features with both collaborated spatial attention and channel attention.The alternate apply of spatial attention and channel attention enables the tracker to focus on both global context information and local fine-grained features.(3)The tracker with a two-stream and multi-level architecture has limited perception of the target and the problem of target loss during long-term tracking.This paper proposes a Transformer long-term object aware tracker,which is a one-stream tracker that includes template online updates.By aggregating feature extraction and feature interaction into a unified backbone network,information of different scales of the target can receive widespread attention from the attention layer.The proposed update strategy also ensures information awareness between the initial template,the update template,and the search image.It can perform extensive information fusion on the target during long-term tracking,and calculate target correlation while enhancing target features.The adopted backbone model is pre trained using a self-supervised approach to further improve tracking performance. |