| The target tracking algorithm refers to an algorithm that can simulate the human visual system and predict the target state of the current frame of the video sequence after the initial state of the target is given in the first frame of the video sequence.With the improvement of the new generation of artificial intelligence technology,the target tracking algorithm has received extensive attention in the fields of video surveillance,human-computer interaction and automatic driving.Among them,the target tracking algorithm based on the Siamese network(Siamese-based)has excellent real-time performance,accuracy and robustness,and gradually become the mainstream in the field of target tracking.However,most of the Siamese-based trackers ignore the coupling between the two parallel branches of the tracker,that is,the feature coupling between the target fore-background binary classification branch with the target bounding box regression branch and the structure coupling of the tracking head,which largely limits the performance of the tracker.Therefore,this paper proposes the following scheme to decouple the Siamese-based tracker:(1)This thesis proposes a decoupled feature extraction network based on attention module to alleviate the coupling between categorical and regression features.Specifically,leveraging the siamese feature backbone network to obtain template features and search region features at different levels,in which,in the last three levels of convolution residual modules,atrous convolutions with different atrous rates are used to gradually expand the receptive field,and for different atrous rates of are filled with different sizes to keep the spatial resolution of the features consistent in order to combine local features with global features.Then,the deformable attention module and the channel attention module are added to the last three levels of convolutional residual modules for task-related feature learning,and the channel attention module models the channel dependencies of template features and search region features respectively,so as to obtain rich contextual features to obtain classification features with more focused semantic information that are more suitable for classification.The deformable attention module maintains the detailed information of local features and emphasizes the local information of high-resolution features to obtain regression features that are more suitable for target bounding box regression.Finally,the siamese backbone network of the tracker can provide the tracking head with more robust classification features and fine-grained information-rich regression features,so as to predict the target state more robustly and accurately.(2)This thesis studies the difference between the classification branch and the regression branch of Siamese-based trackers,and designs a decoupling tracking head which are based Cls-Head and Reg-Head for class estimation and bounding box regression of target objects.The decoupling tracking head utilizes the differential network structure of the tracking head(“Cls-Head” and “Reg-Head”)to task-dependently process the two parallel branches of the trackers,namely the object classification branch and the regression branch.We find that Cls-Head is more suitable for classification tasks because its confidence score correlates more strongly with robust semantic information.Meanwhile,Reg-Head provides more accurate bounding-box regression by locating object boundaries.Compared with traditional tracking head schemes that use the same network structure to extract region of interest(ROI)features,our method divides these two tasks into two different heads to alleviate the misalignment between the classification and regression task domains,making the classification head more robust to classification tasks and the regression head more accurate for bounding-box regression.The performance evaluation experiments conducted on the widely used four tracker performance test benchmarks of OTB2015,VOT2018,VOT2018 and UAV123 and the qualitative,quantitative and real-time comparative analysis experiments of the proposed method prove that the proposed method can run in real time.with excellent robustness and accuracy,which proves the effectiveness of our idea. |