| As one of the basic tasks in the field of computer vision,target tracking in computer vision aims to estimate the position and state of the target in the subsequent frames of the video sequence after giving the position of the target to be tracked in the video sequence.Because there are various challenges such as occlusion and scale transformation in the video sequence,there are high requirements for the tracking algorithm in addition to the high reasoning speed,and the accuracy of the algorithm.As a network structure applied in the field of computer vision in recent years,Transformer has the advantages of long-distance dependence and parallel computing compared with the convolutional neural network(CNN).It is mainly composed of Attention module.Because of the nature of its module,the network can pay close attention to the characteristic information of the target,so that the network has good modeling ability and foreground-background distinguishing ability.Therefore,based on the theory of Transformer and target tracking,this thesis studies the vision target tracking algorithm based on Transformer.The main research contents and achievements are as follows:(1)A transformer tracker based on the reconstructed Patch is proposed.Aiming at the problem that the network ignores the integrity of the object by directly using the features extracted from the backbone of the network to implement the pixel-by-pixel attention strategy for the transformer tracker of the twin-stream network structure,the thesis proposes a method of reconstructing the Patch,which converts the pixel-by-pixel attention strategy into a windowed attention strategy.The windowed attention strategy first preserves the object integrity to the greatest extent,and then the receptive field of each Patch is 4 times that of the pixel-by-pixel method.The information concerned by each Patch is more comprehensive,making the network have better modeling ability.(2)This thesis proposes a tracker for multi-layer feature fusion using Transformer.The traditional tracker uses multi-layer feature fusion to directly weighted sum or splice the features extracted from the backbone network,which leads to the loss of feature information.To solve this problem,we design a Multi-layer Transformer fusion network structure,which makes full use of the advantages of Transformer’s long-term distance dependence and the low-level and high-level features extracted by the feature extraction network,so that the network has excellent foreground and background discrimination ability.(3)A single stream pure transformer tracker is proposed.For the existing practical application level tracking algorithms,in order to ensure high tracking speed,most of them use shallow network for feature extraction,resulting in obvious lack of feature information.To solve this problem,we propose a single-stream pure Transformer tracker without CNN structure.Because of the advantages of parallel computing of Transformer,the tracking accuracy is greatly improved compared with the previous application-level tracker on the premise of ensuring high tracking speed. |