Research On Object Detection Based On Vision Transformer

Posted on:2024-04-26

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Yao

Full Text:PDF

GTID:2568307067473254

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Object detection is a fundamental problem in computer vision that aims to locate and classify objects in images.Convolutional Neural Networks(CNN)have long dominated computer vision tasks such as classification,object detection,and instance segmentation,but in recent years,Vision Transformer(ViT)has shown encouraging performance in different tasks in the field of computer vision.The Transformer modules in ViT rely on the self-attention mechanism to capture the longrange dependencies between patches(i.e.,the global context),and benefited from the shape bias,they are able to focus on important parts of the image.However,the self-attention mechanism may ignore the structural information and local relations in each patch,which also produces a computational complexity of the square of the input size of the image size.This is the deficiency of ViT,but on the contrary,although the limited receptive field of CNN makes it difficult to capture global dependencies,it can effectively reduce local redundancy by convolution in a small neighborhood,and use the local connectivity of translation invariance Each patch in the image is processed by the same weight,and this inductive bias drives CNN to have a stronger dependence on texture rather than shape when classifying visual objects.Therefore,this paper proposes improvements to the ViT-based backbone network and the FPNbased neck structure in the target detection model structure to improve the performance of the ViT-based target detection model.The contributions of this paper are summarized as follows:(1)A local enhancement module composed of multiple sets of convolution and activation functions is proposed to compensate for the lack of comprehensive information in ViTs when extracting features,so that the convolution and Transformer modules can realize Complementary advantages.Second,a channel attention module is introduced to capture channel information generated by frequent operations on channels during self-attention computation.Then,the pooling layer is used to replace some Transformer blocks that use selfattention modules,which reduces the high computing power requirements brought by some self-attention mechanisms.Furthermore,a single Transformer module lacks cross-window correlation.In order to maintain efficient calculation of non-overlapping windows,crosswindow connections are introduced.Then an alternating strategy is devised that alternates the three configurations in consecutive Transformer blocks.Finally,by using convolutional position encoding,the problems of lack of mobility and reusability of absolute position encoding in traditional position encoding and poor performance of relative position encoding due to self-attention modification are avoided.Experiments have shown that without pretraining,the backbone network proposed in this paper achieves 40.3 box AP and 37.1 mask AP on the COCO target detection dataset,which exceeds Res Net-50 10.0 under similar FLOPs and parameter settings.Box mAP and 7.0,CSwin-T 1.8 box mAP and 1.2 mask mAP.(2)A FPN-based neck structure: AFPN is designed to cooperate with the newly designed backbone network above,using semantic information from multi-level feature maps to enrich high-resolution and local attention features,and further increase different scenes.The applicability of the lower model.The experiment uses the same test evaluation indicators and training settings as the backbone network.Through the combination of different backbone networks,AFPN and FPN,it can be concluded that AFPN has a certain degree of improvement compared with FPN.

Keywords/Search Tags:

Object Detection, Vision Transformers, Locally Enhanced, Channel Attention, Feature Pyramid Network

PDF Full Text Request

Related items

1	Target Detection Algorithm Based On Feature Pyramid Structure
2	Efficient And Lightweight Feature Pyramid Network For Object Detection
3	Research On Object Detection Based On Improved Feature Pyramid Networks
4	Research On Object Detection Algorithm Based On Feature Pyramid Fusion And Attention Mechanism
5	Research On Object Detection Algorithm Based On Improved FPN Feature Fusion Strategy
6	MFE:Multi-scale Feature Enhancement For Object Detection
7	Video Object Detection Based On Attention Mechanism And Multi-Scale Feature Fusion Convolutional Network
8	Research On Image Classification Algorithm Base On Vision Transformer
9	Improved Algorithm Of Object Detection Based On One Stage Network Model
10	Research On 3D Small Object Detection Method Based On Attentional Feature Enhancement