Font Size: a A A

Research On Transformer-Based End-to-End Object Detection

Posted on:2024-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:2568307067493744Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Object detection is a fundamental and important task in computer vision,which aims to detect the location of objects and the corresponding categories.There are many well-established detectors based on convolutional neural networks(CNN)make promising results.In recent years,transformer attracted a lot of attention from academia and industry.Thanks to the capability of modeling interrelations of global information,transformer-based detectors can make full advance of context,achieving stronger results.Region-based end-to-end transformer-like detectors,like Sparse R-CNN,perform well.This paper researches such detectors and proposes an end-to-end task-specific detector with Io U-enhanced-attention and a box-location positional encoding-based recursive detector.The contributions of this paper are listed following:(1)Due to the one-to-one interaction strategy between proposal features and proposal boxes,transformer-like detectors rely on self attention extremely.It leads to proposal features easily interacting with irrelevant ones,losing their distinctive identity and harming the performance.This paper proposes to utilize Io U as a prior to enhance self attention.The Io U matrix computed among proposal boxes multiplies the attention matrix,limiting the keys compared with query.Thus,the irrelevant ones are suppressed.Object detection consists of classification and regression,they focus on different regions of an object.The former focuses on the center,while the latter concentrates on the counters.We propose a dynamic channel weighting module to generate two channel masks by lightweight projection head,and then the two masks multiply with object features to extract suitable features for the two tasks.(2)Transformer-like detectors usually have cascade stages to progressively refine predictions to groud-truth.The cascade structure leads to a large number of parameters.This paper intends to share them across the decoder stages,then the model size can be reduced a lot with a little drop in performance.Moreover,we reuse the dynamic conv module to build an in-stage recursive structure and increase the depth of model.The bounding box positional encoding can boost the recursive detector.Positional encoding makes the decoder aware of the location and shapes of the proposal box,thus the model can be more adaptive to proposals in different stages.We further utilize centerness to help kernels and Ro I features distinguish the spatial information within the proposal box.
Keywords/Search Tags:Deep Learning, Object Detection, Transformer, Attention, Positional Encoding
PDF Full Text Request
Related items