Computer Science ›› 2024, Vol. 51 ›› Issue (6): 239-246.doi: 10.11896/jsjkx.230300218

• Computer Graphics & Multimedia • Previous Articles     Next Articles

DETR with Multi-granularity Spatial Attention and Spatial Prior Supervision

LIAO Junshuang, TAN Qinhong   

  1. School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Online:2024-06-15 Published:2024-06-05
  • About author:LIAO Junshuang,born in 1999,postgraduate.His main research interests include computer vision,object detection,etc.
    TAN Qinhong,born in 1968,associate professor.Her main research interests include embedded system design,Internet of Things technology,etc.

Abstract: The Transformer has shown remarkable performance in the field of computer vision in recent years,and has gained widespread attention due to its excellent global modeling capability and competitive performance compared to convolutional neural networks(CNNs).Detection Transformer(DETR) is the first end-to-end network that adopts the Transformer architecture for object detection tasks,but it suffers from slow convergence during training and suboptimal performance due to its equivalent mo-deling across the global scope and indistinguishability of object query keys.To address these issues,we propose replacing the self-attention in the encoder and the cross-attention in the decoder of DETR with a multi-granularity attention mechanism,using fine-grained attention for tokens that are close in distance and coarse-grained attention for tokens that are far apart,to enhance its modeling capability.We also introduce spatial prior constraints in the cross-attention of the decoder to supervise the network training,which accelerates the convergence speed.Experimental results show that the proposed improved model,after incorporating the multi-granularity attention mechanism and spatial prior supervision,achieves a 16% improvement in recognition accuracy on the PASCAL VOC2012 dataset compared to the unmodified DETR,with a doubled convergence speed.

Key words: Multi-granularity spatial attention, Spatial prior supervision, Object detection, Vision Transformer, Encoder-Decoder architecture

CLC Number: 

  • TP391.4
[1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[2]REDMON J,DIVVALA S,GIRSHICK R,et al.You only look once:Unified,real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:779-788.
[3]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2980-2988.
[4]REN S,HE K,GIRSHICK R,et al.Faster r-CNN:Towards real-time object detection with region proposal networks[J].ar-Xiv:1506.01497,2015.
[5]DUAN K,BAI S,XIE L,et al.Centernet:Keypoint triplets for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6569-6578.
[6]LAW H,DENG J.Cornernet:Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:734-750.
[7]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition.2017:2117-2125.
[8]LIU S,QI L,QIN H,et al.Path aggregation network for instance segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8759-8768.
[9]WOO S,PARK J,LEE J Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19.
[10]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16×16 words:Transformers for image recognition at scale[C]//International Conference on Learning Representations.2021:1-22.
[11]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image Transformers & distillation through attention[C]//International Conference on Machine Learning.PMLR,2021:10347-10357.
[12]WANG W,XIE E,LI X,et al.Pyramid vision Transformer:Aversatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:568-578.
[13]WU H,XIAO B,CODELLA N,et al.Cvt:Introducing convolutions to vision Transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:22-31.
[14]LI Y,CHEN Y P,WANG T,et al.Tokens-to-token vit:Training vision Transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:558-567.
[15]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with Transformers[C]//Computer Vision-ECCV 2020:16th European Conference,Glasgow,UK,Part I 16.Springer International Publishing,2020:213-229.
[16]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[17]ZHU X,SU W,LU L,et al.Deformable detr:DeformableTransformers for end-to-end object detection[C]//International Conference on Learning Representations.2021:1-16.
[18]GAO P,ZHENG M,WANG X,et al.Fast convergence of detr with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3621-3630.
[19]YANG J,LI C,ZHANG P,et al.Focal Attention for Long-Range Interactions in Vision Transformers[C]//Advances in Neural Information Processing Systems.2021:30008-30022.
[20]LIU Z,HU H,LIN Y,et al.Swin Transformer v2:Scaling up capacity and resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12009-12019.
[21]ZHANG G,LUO Z,YU Y,et al.Accelerating DETR convergence via semantic-aligned matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:949-958.
[22]LI F,ZHANG H,LIU S,et al.Dn-detr:Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:13619-1362.
[1] LIU Jiasen, HUANG Jun. Center Point Target Detection Algorithm Based on Improved Swin Transformer [J]. Computer Science, 2024, 51(6): 264-271.
[2] LI Yuehao, WANG Dengjiang, JIAN Haifang, WANG Hongchang, CHENG Qinghua. LiDAR-Radar Fusion Object Detection Algorithm Based on BEV Occupancy Prediction [J]. Computer Science, 2024, 51(6): 215-222.
[3] WU Xiaoqin, ZHOU Wenjun, ZUO Chenglin, WANG Yifan, PENG Bo. Salient Object Detection Method Based on Multi-scale Visual Perception Feature Fusion [J]. Computer Science, 2024, 51(5): 143-150.
[4] JIAN Yingjie, YANG Wenxia, FANG Xi, HAN Huan. 3D Object Detection Based on Edge Convolution and Bottleneck Attention Module for Point Cloud [J]. Computer Science, 2024, 51(5): 162-171.
[5] BAI Xuefei, SHEN Wucheng, WANG Wenjian. Salient Object Detection Based on Feature Attention Purification [J]. Computer Science, 2024, 51(5): 125-133.
[6] XU Hao, LI Fengrun, LU Lu. Metal Surface Defect Detection Method Based on Dual-stream YOLOv4 [J]. Computer Science, 2024, 51(4): 209-216.
[7] LIU Zeyu, LIU Jianwei. Video and Image Salient Object Detection Based on Multi-task Learning [J]. Computer Science, 2024, 51(4): 217-228.
[8] HAO Ran, WANG Hongjun, LI Tianrui. Deep Neural Network Model for Transmission Line Defect Detection Based on Dual-branch Sequential Mixed Attention [J]. Computer Science, 2024, 51(3): 135-140.
[9] ZHANG Yang, XIA Ying. Object Detection Method with Multi-scale Feature Fusion for Remote Sensing Images [J]. Computer Science, 2024, 51(3): 165-173.
[10] WANG Weijia, XIONG Wenzhuo, ZHU Shengjie, SONG Ce, SUN He, SONG Yulong. Method of Infrared Small Target Detection Based on Multi-depth Feature Connection [J]. Computer Science, 2024, 51(1): 175-183.
[11] SHI Dianxi, LIU Yangyang, SONG Linna, TAN Jiefu, ZHOU Chenlei, ZHANG Yi. FeaEM:Feature Enhancement-based Method for Weakly Supervised Salient Object Detection via Multiple Pseudo Labels [J]. Computer Science, 2024, 51(1): 233-242.
[12] YANG Yi, SHEN Sheng, DOU Zhiyang, LI Yuan, HAN Zhenjun. Tiny Person Detection for Intelligent Video Surveillance [J]. Computer Science, 2023, 50(9): 75-81.
[13] ZHU Ye, HAO Yingguang, WANG Hongyu. Deep Learning Based Salient Object Detection in Infrared Video [J]. Computer Science, 2023, 50(9): 227-234.
[14] LIU Yubo, GUO Bin, MA Ke, QIU Chen, LIU Sicong. Design of Visual Context-driven Interactive Bot System [J]. Computer Science, 2023, 50(9): 260-268.
[15] WANG Xu, WU Yanxia, ZHANG Xue, HONG Ruize, LI Guangsheng. Survey of Rotating Object Detection Research in Computer Vision [J]. Computer Science, 2023, 50(8): 79-92.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!