| The single-target tracking algorithm based on the Siamese network aims to cut out the search area features in the subsequent frames according to the template features of the target selected in the initial frame,and perform feature fusion of the two parts of the features to find the most relevant area as the tracking result.In this thesis,we propose an end-to-end target tracking network framework based on the feature fusion network that uses Transformer instead of correlation operations,and improves it by introducing time information enhancement features and corner prediction based on anchor-free ideas.The specific content is as follows:1.For the problems of long-term tracking target loss and deformation,this thesis studies feature fusion based on temporal context information to make up for the defects of single target feature.For this reason,this thesis uses the improved Transformer encoder and decoder to enhance the features of the template branch and the search region branch respectively.Specifically,it is to establish a dynamically updated template library.After splicing each template feature extracted by ResNet,it is passed to the encoder,and the attention is used to complete the information transfer across multiple frames to enhance the features obtained from each other.At the same time,the mask feature about the target motion prior information is constructed according to the Gaussian hypothesis prior,and combined with the search area as the attention weight of the search area feature.At the same time,it is combined with the template feature to suppress the background region feature in the template,and it is used as the input of the decoder on the search region branch.The search region feature is combined with multiple template features in continuous time to complete its own feature enhancement.2.In view of the poor robustness of the prediction head of the three-layer perceptron to occlusion,background clusters and other problems,this thesis studies the probability distribution of corner points based on anchor-free prediction.Directly predict the position of the regression frame,no longer use the preset anchor frame and a series of complicated post-processing processes.Bounding box regression for directly predicting the center coordinates is equivalent to obeying a single Dirac distribution,which cannot model the ambiguity of the data set.In this regard,this thesis proposes a regression-classification joint algorithm,removes the classification branch,and jointly represents the classification score and IoU,learns a more arbitrary and flexible general distribution,and learns a discrete probability distribution on a continuous space to represent the bounding box position.At the same time,the DFL loss function is introduced to make the network have richer information and more accurate prediction results.Finally,a feature sparse module is proposed to reduce the number of redundant template feature vectors,reduce the amount of calculation,and improve the real-time performance of the network to solve the problem that the introduction of the attention module leads to the decline of real-time performance. |