| Transformer is widely used to solve object detection problems.However,compared with CNN-based detection models,the detection models of Transformer generally suffer from slow convergence speed.To solve this problem,this paper designs model transformation experiments and concludes through analysis that the main factors affecting the convergence speed of Transformer detection models are region-specific sparse sampling,spatial prior and multi-scale input.In this paper,the detection model of Transformer is improved from the above three influencing factors,respectively.First,this paper integrates region-specific sparse sampling and relative position coding to make lightweight improvements to the attention mechanism of Transformer,and combines spatial a priori prediction to propose a Transformer detection model based on pre-filtered attention.Secondly,based on the proposed model,we further extend its feature extraction,Transformer module and prediction network parts in multiple scales,which greatly shortens the convergence time of Transformer detection model and also achieves the improvement of detection accuracy.The main research work and results of this paper as follows:(1)By comparing the detection models of CNN and Transformer,we found that the convergence speed of the Transformer-based detection model is slow.In order to solve the convergence problem,this paper designs model transformation experiments to obtain the main factors affecting the convergence speed of Transformer detection model are region-specific sparse sampling,spatial prior and multi-scale input,which provide the theoretical basis and improvement ideas for the research of this paper,and the subsequent work combines the three main influencing factors to optimize the Transformer detection model.(2)To address the problem of slow convergence of Transformer detection model,this paper proposes a object detection based on Transformer with prefiltered attention using the idea of region-specific sparse sampling and spatial prior.The model changes the way the original Transformer processes images with a lightweight attention module to reduce the computational effort and save training time.At the same time,a directed relative position encoding is proposed to compensate for the lack of relative position information caused by the attention calculation.Second,the model uses relative offset regression bounding box to reduce the learning difficulty.Experiments on the COCO dataset show that this improved idea successfully accelerates model convergence and relieves the pressure of global modeling.(3)Multi-scale extension of the object detection based on Transformer with prefiltered attention.Firstly,hybrid multi-attention is introduced to construct multi-scale feature inputs to make full use of image features.Secondly,the pre-filtered attention is extended multi-scale to achieve feature fusion and processing.In addition,joint regression loss is proposed to quickly stabilize the regression bounding box and finally establish an accurate and efficient detection model.Experiments on COCO and Cityscapes datasets demonstrate the advantages of the model in improving model convergence speed and accuracy.In summary,in view of the slow convergence of Transformer detection model,this paper proposes a multi-scale Transformer detection model based on prefiltered attention by integrating the ideas of region-specific sparse sampling,spatial prior and multi-scale input with the conclusion of model transformation.This model is proposed,which solves the slow model convergence caused by the global modeling of attention mechanism in the original Transformer and improves the detection accuracy.The advantages of the detection model studied in this paper in improving the convergence speed and accuracy are demonstrated through a large number of experiments. |