计算机科学 ›› 2024, Vol. 51 ›› Issue (6): 239-246.doi: 10.11896/jsjkx.230300218

• 计算机图形学&多媒体 • 上一篇    下一篇

多粒度空间注意力与空间先验监督的DETR

廖峻霜, 谭钦红   

  1. 重庆邮电大学通信与信息工程学院 重庆 400065
  • 出版日期:2024-06-15 发布日期:2024-06-05
  • 通讯作者: 谭钦红(tanqh@cqupt.edu.cn)
  • 作者简介:(s210131121@stu.cqupt.edu.cn)

DETR with Multi-granularity Spatial Attention and Spatial Prior Supervision

LIAO Junshuang, TAN Qinhong   

  1. School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Online:2024-06-15 Published:2024-06-05
  • About author:LIAO Junshuang,born in 1999,postgraduate.His main research interests include computer vision,object detection,etc.
    TAN Qinhong,born in 1968,associate professor.Her main research interests include embedded system design,Internet of Things technology,etc.

摘要: 近年来,Transformer在视觉领域的表现卓越,由于其优秀的全局建模能力以及可媲美CNN的性能表现受到了广泛关注。DETR(Detection Transformer)是在其基础上研究的首个在目标检测任务上采用Transformer架构的端到端网络,但是其全局范围内的等价建模以及目标查询键的无差别性导致其训练收敛缓慢,且性能表现欠佳。针对上述问题,利用多粒度的注意力机制替换DETR的encoder中的自注意力以及decoder中的交叉注意力,在距离近的token之间使用细粒度,在距离远的token之间使用粗粒度,增强其建模能力;并在decoder中的交叉注意力中引入空间先验限制对网络训练进行监督,使其训练收敛速度得以加快。实验结果表明,在引入多粒度的注意力机制和空间先验监督后,相较于未改进的DETR,所提改进模型在PASCAL VOC2012数据集上的识别准确度提升了16%,收敛速度快了2倍。

关键词: 多粒度空间注意力, 空间先验监督, 目标检测, 视觉Transformer, 编解码架构

Abstract: The Transformer has shown remarkable performance in the field of computer vision in recent years,and has gained widespread attention due to its excellent global modeling capability and competitive performance compared to convolutional neural networks(CNNs).Detection Transformer(DETR) is the first end-to-end network that adopts the Transformer architecture for object detection tasks,but it suffers from slow convergence during training and suboptimal performance due to its equivalent mo-deling across the global scope and indistinguishability of object query keys.To address these issues,we propose replacing the self-attention in the encoder and the cross-attention in the decoder of DETR with a multi-granularity attention mechanism,using fine-grained attention for tokens that are close in distance and coarse-grained attention for tokens that are far apart,to enhance its modeling capability.We also introduce spatial prior constraints in the cross-attention of the decoder to supervise the network training,which accelerates the convergence speed.Experimental results show that the proposed improved model,after incorporating the multi-granularity attention mechanism and spatial prior supervision,achieves a 16% improvement in recognition accuracy on the PASCAL VOC2012 dataset compared to the unmodified DETR,with a doubled convergence speed.

Key words: Multi-granularity spatial attention, Spatial prior supervision, Object detection, Vision Transformer, Encoder-Decoder architecture

中图分类号: 

  • TP391.4
[1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[2]REDMON J,DIVVALA S,GIRSHICK R,et al.You only look once:Unified,real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:779-788.
[3]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2980-2988.
[4]REN S,HE K,GIRSHICK R,et al.Faster r-CNN:Towards real-time object detection with region proposal networks[J].ar-Xiv:1506.01497,2015.
[5]DUAN K,BAI S,XIE L,et al.Centernet:Keypoint triplets for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6569-6578.
[6]LAW H,DENG J.Cornernet:Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:734-750.
[7]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition.2017:2117-2125.
[8]LIU S,QI L,QIN H,et al.Path aggregation network for instance segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8759-8768.
[9]WOO S,PARK J,LEE J Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19.
[10]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16×16 words:Transformers for image recognition at scale[C]//International Conference on Learning Representations.2021:1-22.
[11]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image Transformers & distillation through attention[C]//International Conference on Machine Learning.PMLR,2021:10347-10357.
[12]WANG W,XIE E,LI X,et al.Pyramid vision Transformer:Aversatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:568-578.
[13]WU H,XIAO B,CODELLA N,et al.Cvt:Introducing convolutions to vision Transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:22-31.
[14]LI Y,CHEN Y P,WANG T,et al.Tokens-to-token vit:Training vision Transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:558-567.
[15]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with Transformers[C]//Computer Vision-ECCV 2020:16th European Conference,Glasgow,UK,Part I 16.Springer International Publishing,2020:213-229.
[16]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[17]ZHU X,SU W,LU L,et al.Deformable detr:DeformableTransformers for end-to-end object detection[C]//International Conference on Learning Representations.2021:1-16.
[18]GAO P,ZHENG M,WANG X,et al.Fast convergence of detr with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3621-3630.
[19]YANG J,LI C,ZHANG P,et al.Focal Attention for Long-Range Interactions in Vision Transformers[C]//Advances in Neural Information Processing Systems.2021:30008-30022.
[20]LIU Z,HU H,LIN Y,et al.Swin Transformer v2:Scaling up capacity and resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12009-12019.
[21]ZHANG G,LUO Z,YU Y,et al.Accelerating DETR convergence via semantic-aligned matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:949-958.
[22]LI F,ZHANG H,LIU S,et al.Dn-detr:Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:13619-1362.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!