Research On Transformer-based Object Detection With Local And Global Interaction

Posted on:2024-02-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y Chen

Full Text:PDF

GTID:2568307091465384

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Two-dimensional object detection can provide rich semantic and spatial and other information for various vision scenarios such as autonomous driving,which has played an important role in driving progress in the field of computer vision.In recent years,Transformer has given rise to a series of Vision Transformer networks due to its powerful global modeling capability and parallel computing ability demonstrated in the field of natural language processing.However,since the working mechanism of Transformer generates a computational complexity that is squared with the length of the input sequence,which requires a large amount of computational resources and parameter cost for object detection tasks that usually use high-resolution images as input.In addition,in the process of constructing global relationships,some Vision Transformers use an interaction mechanism that directly connects the global from the local,which does not take into account the hierarchical nature of the information interaction process.How to balance the computational complexity and detection performance has become the current research hotspot of Vision Transformer.In this paper,we investigate the Vision Transformer in terms of both detection performance and computational efficiency for the object detection task.On the one hand,it aims to design an effective feature extraction network,in which the importance of global contextual semantic information on the feature representation capability of the network is emphasized,while the characteristic of hierarchical nature between local to global interactions of features is considered.On the other hand,the aim is to explore the balance between computational complexity and detection real-time of the self-attentive mechanism,increasing the potential of the network to expand to realistic applications.The main contributions of this paper are as follows:1.To address the problem of high computational complexity in Transformer,which leads to a square relationship with the length of the input sequence when computing global semantic relations using a multi-headed selfattentive mechanism,this paper proposes a window-level local global interaction algorithm.The core idea of this algorithm is to reduce the length of the input sequence for attention computation at one time,and increase the number of attention computations accordingly,thus achieving a linear reduction in computational complexity.2.In response to the existing local-global interaction methods that directly connect the global from the local,leading to problems such as insufficient information fusion,this paper proposes a new feature extraction network with3 layers of interaction from local to global based on the Swin Transformer pyramidal hierarchy.The network gradually establishes the dependencies between features from three aspects: local interaction,transition interaction between local and global,and global interaction,thus making the process of information fusion more detailed and improving the expressiveness of the network.In order to reduce the impact of introducing the transition module on the computational and parametric quantities of the whole network,a lightweight algorithm based on channel shifting is proposed in this paper.The shift operation does not need to consume additional number of parameters or computation,which effectively reduces the overall complexity of the network and improves the real-time detection capability of the model.3.In this paper,the proposed Transformer local and global interactionbased object detection method is trained and evaluated on the COCO dataset and KITTI 2D dataset,respectively.Among them,relative to the convolutional model,the average accuracy of the method in this paper is 4.5% better than Res Net,and relative to the Vision Transformer which has recently achieved optimal performance,it is 1.1% and 0.5% better than Swin Transformer and Twins-SVT-S,respectively.Therefore,the proposed method achieves more competitive detection results in the field of object detection.Meanwhile,its average accuracy in the KITTI2 D dataset reaches 91.9%,which is expected to contribute to the promotion of autonomous driving and has high potential for practical applications.

Keywords/Search Tags:

object detection, transformer, vision transformer, self-attention, local-global interaction, window partition

PDF Full Text Request

Related items

1	Research On Human Action Recognition Fusing 2D CNN And Vision Transformer
2	Research On Pedestrian Detection Algorithm Based On Vision Transformer
3	Research On Transformer-based Object Detection
4	Research And Implementation Of Checkout Human Object Interaction Detetion Based On Deep Learning
5	Research On Transformer-Based End-to-End Object Detection
6	Research On RGB-D Salient Object Detection Based On Depth Perception And Fusion
7	Visual Object Tracking Research Based On Transformer
8	Research On Object Detection Based On Vision Transformer
9	Research On Data Efficient Vision Transformer Network
10	Spatio-Temporal Interaction Transformer For Action Recognition In Videos