计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 130-147.doi: 10.11896/jsjkx.221100076
陈洛轩1,2, 林成创3, 郑招良1,2, 莫泽枫1,2, 黄心怡1,2, 赵淦森1,2
CHEN Luoxuan1,2, LIN Chengchuang3, ZHENG Zhaoliang1,2, MO Zefeng1,2, HUANG Xinyi1,2, ZHAO Gansen1,2
摘要: Transformer是一种基于注意力的编码器-解码器架构,其凭借长距离建模能力与并行计算能力在自然语言处理领域取得了重大突破,并逐步拓展应用至计算机视觉领域,成为了计算机视觉任务的重要研究方向。文中重点回顾与总结了Transformer在图像分类、目标检测与图像分割三大计算机视觉任务中的应用和改进。首先,以图像分类任务为切入点,从数据规模、结构特点、计算效率等方面深入分析了当前视觉Transformer存在的关键问题,并基于关键问题对解决方法和思路进行了分类。其次,全面梳理了视觉Transformer在目标检测与图像分割两大领域的研究进展,并根据结构特点、设计动机来组织这些方法,分析对比代表性方法的优点与不足。最后,对Transformer在计算机视觉任务中亟待解决的问题以及发展趋势进行了总结和探讨。
中图分类号:
[1]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105. [2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,30:5998-6008. [3]HUANG Z,WANG X,HUANG L,et al.Ccnet:Criss-cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:603-612. [4]WANG X,GIRSHICK R,GUPTA A,et al.Non-local neuralnetworks[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.2018:7794-7803. [5]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141. [6]CAO Y,XU J,LIN S,et al.Gcnet:Non-local networks meetsqueeze-excitation networks and beyond[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.2019. [7]WOO S,PARK J,LEE J-Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19. [8]BELLO I,ZOPH B,VASWANI A,et al.Attention augmented convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:3286-3295. [9]ZHAO H,JIA J,KOLTUN V.Exploring self-attention forimage recognition[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10076-10085. [10]PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//International Conference on Machine Learning.2018:4055-4064. [11]HU H,GU J,ZHANG Z,et al.Relation networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3588-3597. [12]VILA L C,ESCOLANO C,FONOLLOSA J A,et al.End-to-End Speech Translation with the Transformer[C]//IberSPEECH.2018:60-63. [13]TOPAL M O,BAS A,VAN HEERDEN I.Exploring transfor-mers in natural language generation:Gpt,bert,and xlnet[J].ar-Xiv:2102.08036,2021. [14]LI N,LIU S,LIU Y,et al.Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6706-6713. [15]DONG L,XU S,XU B.Speech-transformer:a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2018:5884-5888. [16]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[R].OpenAI,2018. [17]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901. [18]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9-32. [19]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with transformer[C]//European Conference on Computer Vision.2020:213-229. [20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020. [21]ZHENG S,LU J,ZHAO H,et al.Rethinking semantic segmentation from a sequence-to-sequence perspective with transfor-mers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:6881-6890. [22]CHEN J,LU Y,YU Q,et al.Transunet:Transformers make strong encoders for medical image segmentation[J].arXiv:2102.04306,2021. [23]LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:8928-8937. [24]HENDRYCKS D,GIMPEL K.Gaussian error linear units(gelus)[J].arXiv:1606.08415,2016. [25]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [26]TAN M,LE Q.Efficientnet:Rethinking model scaling for con-volutional neural networks[C]//International Conference on Machine Learning.2019:6105-6114. [27]XIAO T,DOLLAR P,SINGH M,et al.Early convolutions help transformers see better[J].Advances in Neural Information Processing Systems,2021,34:30392-30400. [28]GRAHAM B,EL-NOUBY A,TOUVRON H,et al.LeViT:aVision Transformer in ConvNet's Clothing for Faster Inference[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:12259-12269. [29]WU H,XIAO B,CODELLA N,et al.Cvt:Introducing convolutions to vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:22-31. [30]DAI Z,LIU H,LE Q,et al.Coatnet:Marrying convolution and attention for all data sizes[J].Advances in Neural Information Processing Systems,2021,34:3965-3977. [31]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[C]//International Conference on Machine Learning.2021:10347-10357. [32]D'ASCOLI S,TOUVRON H,LEAVITT M L,et al.Convit:Improving vision transformers with soft convolutional inductive biases[C]//International Conference on Machine Learning.2021:2286-2296. [33]LI Y,ZHANG K,CAO J,et al.Localvit:Bringing locality to vision transformers[J].arXiv:2104.05707,2021. [34]WANG W,XIE E,LI X,et al.Pyramid vision transformer:Aversatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:568-578. [35]YUAN K,GUO S,LIU Z,et al.Incorporating convolution de-signs into visual transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:579-588. [36]GUO J,HAN K,WU H,et al.Cmt:Convolutional neural networks meet vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12175-12185. [37]PENG Z,HUANG W,GU S,et al.Conformer:Local featurescoupling global representations for visual recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:367-376. [38]CHEN Y,DAI X,CHEN D,et al.Mobile-former:Bridging mobilenet and transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:5270-5279. [39]HAN K,XIAO A,WU E,et al.Transformer in transformer[J].Advances in Neural Information Processing Systems,2021,34:15908-15919. [40]PANG Y,SUN M,JIANG X,et al.Convolution in convolution for network in network[J].IEEE Transactions on Neural Networks and Learning Systems,2017,29(5):1587-1597. [41]YUAN L,CHEN Y,WANG T,et al.Tokens-to-token vit:Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:558-567. [42]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022. [43]DONG X,BAO J,CHEN D,et al.Cswin transformer:A general vision transformer backbone with cross-shaped windows[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12124-12134. [44]XIA Z,PAN X,SONG S,et al.Vision transformer with defor-mable attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:4794-4803. [45]TU Z,TALEBI H,ZHANG H,et al.Maxvit:Multi-axis vision transformer[J].arXiv:2204.01697,2022. [46]YANG J,LI C,ZHANG P,et al.Focal self-attention for local-global interactions in vision transformers[J].arXiv:2107.00641,2021. [47]WANG W,YAO L,CHEN L,et al.Crossformer:A versatile vision transformer based on cross-scale attention[J].arXiv:2108.00154,2021. [48]YUAN L,HOU Q,JIANG Z,et al.Volo:Vision outlooker for visual recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):6575-6586. [49]WANG W,XIE E,LI X,et al.PVT v2:Improved baselines with Pyramid Vision Transformer[J].Computational Visual Media,2022,8(3):415-424. [50]CHEN C F R,FAN Q,PANDA R.Crossvit:Cross-attentionmulti-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:357-366. [51]HEO B,YUN S,HAN D,et al.Rethinking spatial dimensions of vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:11936-11945. [52]CHU X,TIAN Z,WANG Y,et al.Twins:Revisiting the design of spatial attention in vision transformers[J].Advances in Neural Information Processing Systems,2021,34:9355-9366. [53]ZHANG P,DAI X,YANG J,et al.Multi-scale vision longfor-mer:A new vision transformer for high-resolution image encoding[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2021:2998-3008. [54]BELTAGY I,PETERS M E,COHAN A.Longformer:Thelong-document transformer[J].arXiv:2004.05150,2020. [55]ZHOU D,KANG B,JIN X,et al.Deepvit:Towards deeper vision transformer[J].arXiv:2103.11886,2021. [56]TOUVRON H,CORD M,SABLAYROLLES A,et al.Goingdeeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:32-42. [57]SHAW P,USZKOREIT J,VASWANI A.Self-attention withrelative position representations[J].arXiv:1803.02155,2018. [58]CHU X,TIAN Z,ZHANG B,et al.Conditional positional encodings for vision transformers[J].arXiv:2102.10882,2021. [59]FELZENSZWALB P,MCALLESTER D,RAMANAN D.A discriminatively trained,multiscale,deformable part model[C]//2008 IEEE Conference on Computer Vision and Pattern Recognition.2008:1-8. [60]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR'05).2005:886-893. [61]VIOLA P,JONES M.Rapid object detection using a boostedcascade of simple features[C]//Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 2001).2001. [62]GIRSHICK R.Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1440-1448. [63]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99. [64]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2117-2125. [65]CAI Z,VASCONCELOS N.Cascade r-cnn:Delving into highquality object detection[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2018:6154-6162. [66]GIRSHICK R,DONAHUE J,DARRELL T,et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014:580-587. [67]HE K,ZHANG X,REN S,et al.Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Tran-sactions on Pattern Analysis and Machine Intelligence,2015,37(9):1904-1916. [68]LIU W,ANGUELOV D,ERHAN D,et al.Ssd:Single shotmultibox detector[C]//European Conference on Computer Vision.2016:21-37. [69]REDMON J,DIVVALA S,GIRSHICK R,et al.You only look once:Unified,real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:779-788. [70]CAO J L,LI Y L,SUN H Q,et al.A Survey of Visual Object Detection Technology Based on Deep Learning[J].Journal of Image and Graphics,2022,27(6):1697-1722. [71]ZHU X,SU W,LU L,et al.Deformable detr:Deformable transformers for end-to-end object detection[J].arXiv:2010.04159,2020. [72]DAI J,QI H,XIONG Y,et al.Deformable convolutional net-works[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:764-773. [73]MENG D,CHEN X,FAN Z,et al.Conditional detr for fasttraining convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3651-3660. [74]YAO Z,AI J,LI B,et al.Efficient detr:improving end-to-end object detector with dense prior[J].arXiv:2104.01318,2021. [75]LIU S,LI F,ZHANG H,et al.DAB-DETR:Dynamic anchor boxes are better queries for DETR[J].arXiv:2201.12329,2022. [76]GAO P,ZHENG M,WANG X,et al.Fast convergence of detr with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3621-3630. [77]SUN Z,CAO S,YANG Y,et al.Rethinking transformer-based set prediction for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3611-3620. [78]DAI X,CHEN Y,YANG J,et al.Dynamic detr:End-to-end object detection with dynamic attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:2988-2997. [79]LI F,ZHANG H,LIU S,et al.Dn-detr:Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:13619-13627. [80]DAI Z,CAI B,LIN Y,et al.Up-detr:Unsupervised pre-training for object detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:1601-1610. [81]ZHENG M,GAO P,ZHANG R,et al.End-to-end object detection with adaptive clustering transformer[J].arXiv:2011.09315,2020. [82]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014. [83]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [84]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708. [85]BEAL J,KIM E,TZENG E,et al.Toward transformer-basedobject detection[J].arXiv:2012.09958,2020. [86]FANG Y,LIAO B,WANG X,et al.You only look at one se-quence:Rethinking transformer in vision through object detection[J].Advances in Neural Information Processing Systems,2021,34:26183-26197. [87]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2980-2988. [88]HE K,GKIOXARI G,DOLLÁR P,et al.Mask r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2961-2969. [89]TIAN X,WANG L,DING Q.Review of Image Semantic Se-gmentation Based on Deep Learning[J].Journal of Software,2019,30(2):440-468. [90]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440. [91]CHEN L C,PAPANDREOU G,KOKKINOS I,et al.Deeplab:Semantic image segmentation with deep convolutional nets,atrous convolution,and fully connected crfs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(4):834-848. [92]PENG C,ZHANG X,YU G,et al.Large kernel matters--im-prove semantic segmentation by global convolutional network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4353-4361. [93]LI X,ZHANG L,YOU A,et al.Global aggregation then local distribution in fully convolutional networks[J].arXiv:1909.07229,2019. [94]STRUDEL R,GARCIA R,LAPTEV I,et al.Segmenter:Transformer for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:7262-7272. [95]WU S,WU T,LIN F,et al.Fully transformer networks for semantic image segmentation[J].arXiv:2106.04108,2021. [96]XIE E,WANG W,YU Z,et al.SegFormer:Simple and efficient design for semantic segmentation with transformers[J].Advances in Neural Information Processing Systems,2021,34:12077-12090. [97]SU L,SUN Y X,YUAN S Z.A Survey of Instance Segmentation Based on Deep Learning[J].CAAI Transactions on Intelligent Systems,2022,17(1):16-31. [98]HU J,CAO L,LU Y,et al.Istr:End-to-end instance segmentation with transformers[J].arXiv:2105.00637,2021. [99]GUO R,NIU D,QU L,et al.Sotr:Segmenting objects withtransformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:7157-7166. [100]WANG Y,XU Z,WANG X,et al.End-to-end video instance segmentation with transformer[C]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.2021:8741-8750. [101]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Compu-ter-assisted Intervention.2015:234-241. [102]OKTAY O,SCHLEMPER J,FOLGOC L L,et al.Attentionu-net:Learning where to look for the pancreas[J].arXiv:1804.03999,2018. [103]CHANG Y,HU M H,ZHAI G T,et al.Transclaw u-net:Claw u-net with transformers for medical image segmentation[J].arXiv:2107.05188,2021. [104]YAO C,TANG J,HU M,et al.Claw u-net:A unet-based net-work with deep feature concatenation for scleral blood vessel segmentation[J].arXiv:2010.10163,2020. [105]XU G,WU X,ZHANG X,et al.Levit-unet:Make faster encoders with transformer for medical image segmentation[J].arXiv:2170.08623,2021. [106]PETIT O,THOME N,RAMBOUR C,et al.U-net transformer:Self and cross attention for medical image segmentation[C]//International Workshop on Machine Learning in Medical Imaging.2021:267-276. [107]GAO Y,ZHOU M,METAXAS D N.UTNet:a hybrid transformer architecture for medical image segmentation[C]//International Conference on Medical Image Computing and Compu-ter-Assisted Intervention.2021:61-71. [108]ZHANG Y,LIU H,HU Q.Transfuse:Fusing transformers and cnns for medical image segmentation[C]//International Confe-rence on Medical Image Computing and Computer-Assisted Intervention.2021:14-24. [109]ZHOU H Y,GUO J,ZHANG Y,et al.nnFormer:Interleaved Transformer for Volumetric Segmentation[J].arXiv:2109.03201,2021. [110]XIE Y,ZHANG J,SHEN C,et al.Cotr:Efficiently bridging cnn and transformer for 3d medical image segmentation[C]//Internation Alconference on Medical Image Computing and Compu-ter-assisted Intervention.2021:171-180. [111]HATAMIZADEH A,TANG Y,NATH V,et al.Unetr:Transformers for 3d medical image segmentation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2022:574-584. [112]VALANARASU J M J,OZA P,HACIHALILOGLU I,et al.Medical transformer:Gated axial-attention for medical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.2021:36-46. [113]KARIMI D,VASYLECHKO S D,GHOLIPOUR A.Convolution-free medical image segmentation using transformers[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.2021:78-88. [114]CAO H,WANG Y,CHEN J,et al.Swin-unet:Unet-like puretransformer for medical image segmentation[J].arXiv:2105.05537,2021. [115]LIN A,CHEN B,XU J,et al.DS-TransUNet:Dual swin Transformer U-Net for medical image segmentation[J].arXiv:2106.06716,2021. |
|