Font Size: a A A

Research On Object Detection Based On Vision Transformer

Posted on:2024-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:S Y YaoFull Text:PDF
GTID:2568307067473254Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Object detection is a fundamental problem in computer vision that aims to locate and classify objects in images.Convolutional Neural Networks(CNN)have long dominated computer vision tasks such as classification,object detection,and instance segmentation,but in recent years,Vision Transformer(ViT)has shown encouraging performance in different tasks in the field of computer vision.The Transformer modules in ViT rely on the self-attention mechanism to capture the longrange dependencies between patches(i.e.,the global context),and benefited from the shape bias,they are able to focus on important parts of the image.However,the self-attention mechanism may ignore the structural information and local relations in each patch,which also produces a computational complexity of the square of the input size of the image size.This is the deficiency of ViT,but on the contrary,although the limited receptive field of CNN makes it difficult to capture global dependencies,it can effectively reduce local redundancy by convolution in a small neighborhood,and use the local connectivity of translation invariance Each patch in the image is processed by the same weight,and this inductive bias drives CNN to have a stronger dependence on texture rather than shape when classifying visual objects.Therefore,this paper proposes improvements to the ViT-based backbone network and the FPNbased neck structure in the target detection model structure to improve the performance of the ViT-based target detection model.The contributions of this paper are summarized as follows:(1)A local enhancement module composed of multiple sets of convolution and activation functions is proposed to compensate for the lack of comprehensive information in ViTs when extracting features,so that the convolution and Transformer modules can realize Complementary advantages.Second,a channel attention module is introduced to capture channel information generated by frequent operations on channels during self-attention computation.Then,the pooling layer is used to replace some Transformer blocks that use selfattention modules,which reduces the high computing power requirements brought by some self-attention mechanisms.Furthermore,a single Transformer module lacks cross-window correlation.In order to maintain efficient calculation of non-overlapping windows,crosswindow connections are introduced.Then an alternating strategy is devised that alternates the three configurations in consecutive Transformer blocks.Finally,by using convolutional position encoding,the problems of lack of mobility and reusability of absolute position encoding in traditional position encoding and poor performance of relative position encoding due to self-attention modification are avoided.Experiments have shown that without pretraining,the backbone network proposed in this paper achieves 40.3 box AP and 37.1 mask AP on the COCO target detection dataset,which exceeds Res Net-50 10.0 under similar FLOPs and parameter settings.Box mAP and 7.0,CSwin-T 1.8 box mAP and 1.2 mask mAP.(2)A FPN-based neck structure: AFPN is designed to cooperate with the newly designed backbone network above,using semantic information from multi-level feature maps to enrich high-resolution and local attention features,and further increase different scenes.The applicability of the lower model.The experiment uses the same test evaluation indicators and training settings as the backbone network.Through the combination of different backbone networks,AFPN and FPN,it can be concluded that AFPN has a certain degree of improvement compared with FPN.
Keywords/Search Tags:Object Detection, Vision Transformers, Locally Enhanced, Channel Attention, Feature Pyramid Network
PDF Full Text Request
Related items