计算机科学 ›› 2024, Vol. 51 ›› Issue (6): 239-246.doi: 10.11896/jsjkx.230300218
廖峻霜, 谭钦红
LIAO Junshuang, TAN Qinhong
摘要: 近年来,Transformer在视觉领域的表现卓越,由于其优秀的全局建模能力以及可媲美CNN的性能表现受到了广泛关注。DETR(Detection Transformer)是在其基础上研究的首个在目标检测任务上采用Transformer架构的端到端网络,但是其全局范围内的等价建模以及目标查询键的无差别性导致其训练收敛缓慢,且性能表现欠佳。针对上述问题,利用多粒度的注意力机制替换DETR的encoder中的自注意力以及decoder中的交叉注意力,在距离近的token之间使用细粒度,在距离远的token之间使用粗粒度,增强其建模能力;并在decoder中的交叉注意力中引入空间先验限制对网络训练进行监督,使其训练收敛速度得以加快。实验结果表明,在引入多粒度的注意力机制和空间先验监督后,相较于未改进的DETR,所提改进模型在PASCAL VOC2012数据集上的识别准确度提升了16%,收敛速度快了2倍。
中图分类号:
[1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008. [2]REDMON J,DIVVALA S,GIRSHICK R,et al.You only look once:Unified,real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:779-788. [3]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2980-2988. [4]REN S,HE K,GIRSHICK R,et al.Faster r-CNN:Towards real-time object detection with region proposal networks[J].ar-Xiv:1506.01497,2015. [5]DUAN K,BAI S,XIE L,et al.Centernet:Keypoint triplets for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6569-6578. [6]LAW H,DENG J.Cornernet:Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:734-750. [7]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition.2017:2117-2125. [8]LIU S,QI L,QIN H,et al.Path aggregation network for instance segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8759-8768. [9]WOO S,PARK J,LEE J Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19. [10]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16×16 words:Transformers for image recognition at scale[C]//International Conference on Learning Representations.2021:1-22. [11]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image Transformers & distillation through attention[C]//International Conference on Machine Learning.PMLR,2021:10347-10357. [12]WANG W,XIE E,LI X,et al.Pyramid vision Transformer:Aversatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:568-578. [13]WU H,XIAO B,CODELLA N,et al.Cvt:Introducing convolutions to vision Transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:22-31. [14]LI Y,CHEN Y P,WANG T,et al.Tokens-to-token vit:Training vision Transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:558-567. [15]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with Transformers[C]//Computer Vision-ECCV 2020:16th European Conference,Glasgow,UK,Part I 16.Springer International Publishing,2020:213-229. [16]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [17]ZHU X,SU W,LU L,et al.Deformable detr:DeformableTransformers for end-to-end object detection[C]//International Conference on Learning Representations.2021:1-16. [18]GAO P,ZHENG M,WANG X,et al.Fast convergence of detr with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3621-3630. [19]YANG J,LI C,ZHANG P,et al.Focal Attention for Long-Range Interactions in Vision Transformers[C]//Advances in Neural Information Processing Systems.2021:30008-30022. [20]LIU Z,HU H,LIN Y,et al.Swin Transformer v2:Scaling up capacity and resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12009-12019. [21]ZHANG G,LUO Z,YU Y,et al.Accelerating DETR convergence via semantic-aligned matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:949-958. [22]LI F,ZHANG H,LIU S,et al.Dn-detr:Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:13619-1362. |
|