Transformer在计算机视觉场景下的研究综述

doi:10.11896/jsjkx.221100076

计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 130-147.doi: 10.11896/jsjkx.221100076

• 计算机图形学&多媒体 • 上一篇下一篇

Transformer在计算机视觉场景下的研究综述

陈洛轩^1,2, 林成创³, 郑招良^1,2, 莫泽枫^1,2, 黄心怡^1,2, 赵淦森^1,2

1 华南师范大学计算机学院广州 510663
2 广州市云计算安全与测评技术重点实验室广州 510663
3 广东省电信规划设计院有限公司广州 510630

收稿日期:2022-11-09 修回日期:2023-02-25 出版日期:2023-12-15 发布日期:2023-12-07
通讯作者: 赵淦森(gzhao@m.scnu.edu.cn)
作者简介:(chenlx@m.scnu.edu.cn)
基金资助:
国家自然科学基金(82271267);国家社会科学基金重大项目(19ZDA041)

Review of Transformer in Computer Vision

CHEN Luoxuan^1,2, LIN Chengchuang³, ZHENG Zhaoliang^1,2, MO Zefeng^1,2, HUANG Xinyi^1,2, ZHAO Gansen^1,2

1 School of Computer Science,South China Normal University,Guangzhou 510663,China
2 Guangzhou Key Lab on Cloud Computing Security and Assessment Technology,Guangzhou 510663,China
3 Guangdong Planning and Designing Institute of Telecommunications Co.,Ltd,Guangzhou 510630,China

Received:2022-11-09 Revised:2023-02-25 Online:2023-12-15 Published:2023-12-07
About author:CHEN Luoxuan,born in 1998,postgra-duate,is a member of China Computer Federation.Her main research interests include computer vision and medical image segmentation.
ZHAO Gansen,born in 1977,Ph.D,professor.His main research interests include medical artificial intelligence,medical image,and medical data analysis.
Supported by:
National Natural Science Foundation of China(82271267) and National Social Science Fund of China(19ZDA041).

摘要/Abstract

摘要： Transformer是一种基于注意力的编码器-解码器架构,其凭借长距离建模能力与并行计算能力在自然语言处理领域取得了重大突破,并逐步拓展应用至计算机视觉领域,成为了计算机视觉任务的重要研究方向。文中重点回顾与总结了Transformer在图像分类、目标检测与图像分割三大计算机视觉任务中的应用和改进。首先,以图像分类任务为切入点,从数据规模、结构特点、计算效率等方面深入分析了当前视觉Transformer存在的关键问题,并基于关键问题对解决方法和思路进行了分类。其次,全面梳理了视觉Transformer在目标检测与图像分割两大领域的研究进展,并根据结构特点、设计动机来组织这些方法,分析对比代表性方法的优点与不足。最后,对Transformer在计算机视觉任务中亟待解决的问题以及发展趋势进行了总结和探讨。

关键词: 视觉Transformer, 计算机视觉, 图像分类, 目标检测, 图像分割

Abstract: Transformer is an attention-based encoder-decoder architecture.Due to its long-range sequence modeling and parallel computing capability,Transformer have made a significant breakthrough in natural language processing and is gradually expanding to computer vision(CV) fields,which has become an important research direction in CV tasks.Three sorts of visual Transformer-based CV task,including classification,object detection and segmentation,are focused on by this paper,which summarizes their application and modification.Starting from image classification,this paper first analyses the existing issue in vision Transformer including data size,structure and computational efficiency,then sorts out the corresponding solutions according to the issue.Besides,this paper provides a literature review on object detection and segmentation,which organizes these methods accor-ding to their structures and motivations and summarizes their corresponding pros and cons.Finally,the challenges and future development trends of the Transformer in vision transformer are summarized and discussed in this paper.

Key words: Vison Transformer, Computer vision, Image classification, Object detection, Image segmentation

中图分类号:

TP391

陈洛轩, 林成创, 郑招良, 莫泽枫, 黄心怡, 赵淦森. Transformer在计算机视觉场景下的研究综述[J]. 计算机科学, 2023, 50(12): 130-147. https://doi.org/10.11896/jsjkx.221100076

CHEN Luoxuan, LIN Chengchuang, ZHENG Zhaoliang, MO Zefeng, HUANG Xinyi, ZHAO Gansen. Review of Transformer in Computer Vision[J]. Computer Science, 2023, 50(12): 130-147. https://doi.org/10.11896/jsjkx.221100076

参考文献

[1]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105.
[2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,30:5998-6008.
[3]HUANG Z,WANG X,HUANG L,et al.Ccnet:Criss-cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:603-612.
[4]WANG X,GIRSHICK R,GUPTA A,et al.Non-local neuralnetworks[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.2018:7794-7803.
[5]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141.
[6]CAO Y,XU J,LIN S,et al.Gcnet:Non-local networks meetsqueeze-excitation networks and beyond[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.2019.
[7]WOO S,PARK J,LEE J-Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19.
[8]BELLO I,ZOPH B,VASWANI A,et al.Attention augmented convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:3286-3295.
[9]ZHAO H,JIA J,KOLTUN V.Exploring self-attention forimage recognition[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10076-10085.
[10]PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//International Conference on Machine Learning.2018:4055-4064.
[11]HU H,GU J,ZHANG Z,et al.Relation networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3588-3597.
[12]VILA L C,ESCOLANO C,FONOLLOSA J A,et al.End-to-End Speech Translation with the Transformer[C]//IberSPEECH.2018:60-63.
[13]TOPAL M O,BAS A,VAN HEERDEN I.Exploring transfor-mers in natural language generation:Gpt,bert,and xlnet[J].ar-Xiv:2102.08036,2021.
[14]LI N,LIU S,LIU Y,et al.Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6706-6713.
[15]DONG L,XU S,XU B.Speech-transformer:a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2018:5884-5888.
[16]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[R].OpenAI,2018.
[17]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[18]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9-32.
[19]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with transformer[C]//European Conference on Computer Vision.2020:213-229.
[20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[21]ZHENG S,LU J,ZHAO H,et al.Rethinking semantic segmentation from a sequence-to-sequence perspective with transfor-mers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:6881-6890.
[22]CHEN J,LU Y,YU Q,et al.Transunet:Transformers make strong encoders for medical image segmentation[J].arXiv:2102.04306,2021.
[23]LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:8928-8937.
[24]HENDRYCKS D,GIMPEL K.Gaussian error linear units(gelus)[J].arXiv:1606.08415,2016.
[25]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[26]TAN M,LE Q.Efficientnet:Rethinking model scaling for con-volutional neural networks[C]//International Conference on Machine Learning.2019:6105-6114.
[27]XIAO T,DOLLAR P,SINGH M,et al.Early convolutions help transformers see better[J].Advances in Neural Information Processing Systems,2021,34:30392-30400.
[28]GRAHAM B,EL-NOUBY A,TOUVRON H,et al.LeViT:aVision Transformer in ConvNet's Clothing for Faster Inference[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:12259-12269.
[29]WU H,XIAO B,CODELLA N,et al.Cvt:Introducing convolutions to vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:22-31.
[30]DAI Z,LIU H,LE Q,et al.Coatnet:Marrying convolution and attention for all data sizes[J].Advances in Neural Information Processing Systems,2021,34:3965-3977.
[31]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[C]//International Conference on Machine Learning.2021:10347-10357.
[32]D'ASCOLI S,TOUVRON H,LEAVITT M L,et al.Convit:Improving vision transformers with soft convolutional inductive biases[C]//International Conference on Machine Learning.2021:2286-2296.
[33]LI Y,ZHANG K,CAO J,et al.Localvit:Bringing locality to vision transformers[J].arXiv:2104.05707,2021.
[34]WANG W,XIE E,LI X,et al.Pyramid vision transformer:Aversatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:568-578.
[35]YUAN K,GUO S,LIU Z,et al.Incorporating convolution de-signs into visual transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:579-588.
[36]GUO J,HAN K,WU H,et al.Cmt:Convolutional neural networks meet vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12175-12185.
[37]PENG Z,HUANG W,GU S,et al.Conformer:Local featurescoupling global representations for visual recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:367-376.
[38]CHEN Y,DAI X,CHEN D,et al.Mobile-former:Bridging mobilenet and transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:5270-5279.
[39]HAN K,XIAO A,WU E,et al.Transformer in transformer[J].Advances in Neural Information Processing Systems,2021,34:15908-15919.
[40]PANG Y,SUN M,JIANG X,et al.Convolution in convolution for network in network[J].IEEE Transactions on Neural Networks and Learning Systems,2017,29(5):1587-1597.
[41]YUAN L,CHEN Y,WANG T,et al.Tokens-to-token vit:Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:558-567.
[42]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[43]DONG X,BAO J,CHEN D,et al.Cswin transformer:A general vision transformer backbone with cross-shaped windows[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12124-12134.
[44]XIA Z,PAN X,SONG S,et al.Vision transformer with defor-mable attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:4794-4803.
[45]TU Z,TALEBI H,ZHANG H,et al.Maxvit:Multi-axis vision transformer[J].arXiv:2204.01697,2022.
[46]YANG J,LI C,ZHANG P,et al.Focal self-attention for local-global interactions in vision transformers[J].arXiv:2107.00641,2021.
[47]WANG W,YAO L,CHEN L,et al.Crossformer:A versatile vision transformer based on cross-scale attention[J].arXiv:2108.00154,2021.
[48]YUAN L,HOU Q,JIANG Z,et al.Volo:Vision outlooker for visual recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):6575-6586.
[49]WANG W,XIE E,LI X,et al.PVT v2:Improved baselines with Pyramid Vision Transformer[J].Computational Visual Media,2022,8(3):415-424.
[50]CHEN C F R,FAN Q,PANDA R.Crossvit:Cross-attentionmulti-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:357-366.
[51]HEO B,YUN S,HAN D,et al.Rethinking spatial dimensions of vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:11936-11945.
[52]CHU X,TIAN Z,WANG Y,et al.Twins:Revisiting the design of spatial attention in vision transformers[J].Advances in Neural Information Processing Systems,2021,34:9355-9366.
[53]ZHANG P,DAI X,YANG J,et al.Multi-scale vision longfor-mer:A new vision transformer for high-resolution image encoding[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2021:2998-3008.
[54]BELTAGY I,PETERS M E,COHAN A.Longformer:Thelong-document transformer[J].arXiv:2004.05150,2020.
[55]ZHOU D,KANG B,JIN X,et al.Deepvit:Towards deeper vision transformer[J].arXiv:2103.11886,2021.
[56]TOUVRON H,CORD M,SABLAYROLLES A,et al.Goingdeeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:32-42.
[57]SHAW P,USZKOREIT J,VASWANI A.Self-attention withrelative position representations[J].arXiv:1803.02155,2018.
[58]CHU X,TIAN Z,ZHANG B,et al.Conditional positional encodings for vision transformers[J].arXiv:2102.10882,2021.
[59]FELZENSZWALB P,MCALLESTER D,RAMANAN D.A discriminatively trained,multiscale,deformable part model[C]//2008 IEEE Conference on Computer Vision and Pattern Recognition.2008:1-8.
[60]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR'05).2005:886-893.
[61]VIOLA P,JONES M.Rapid object detection using a boostedcascade of simple features[C]//Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 2001).2001.
[62]GIRSHICK R.Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1440-1448.
[63]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99.
[64]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2117-2125.
[65]CAI Z,VASCONCELOS N.Cascade r-cnn:Delving into highquality object detection[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2018:6154-6162.
[66]GIRSHICK R,DONAHUE J,DARRELL T,et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014:580-587.
[67]HE K,ZHANG X,REN S,et al.Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Tran-sactions on Pattern Analysis and Machine Intelligence,2015,37(9):1904-1916.
[68]LIU W,ANGUELOV D,ERHAN D,et al.Ssd:Single shotmultibox detector[C]//European Conference on Computer Vision.2016:21-37.
[69]REDMON J,DIVVALA S,GIRSHICK R,et al.You only look once:Unified,real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:779-788.
[70]CAO J L,LI Y L,SUN H Q,et al.A Survey of Visual Object Detection Technology Based on Deep Learning[J].Journal of Image and Graphics,2022,27(6):1697-1722.
[71]ZHU X,SU W,LU L,et al.Deformable detr:Deformable transformers for end-to-end object detection[J].arXiv:2010.04159,2020.
[72]DAI J,QI H,XIONG Y,et al.Deformable convolutional net-works[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:764-773.
[73]MENG D,CHEN X,FAN Z,et al.Conditional detr for fasttraining convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3651-3660.
[74]YAO Z,AI J,LI B,et al.Efficient detr:improving end-to-end object detector with dense prior[J].arXiv:2104.01318,2021.
[75]LIU S,LI F,ZHANG H,et al.DAB-DETR:Dynamic anchor boxes are better queries for DETR[J].arXiv:2201.12329,2022.
[76]GAO P,ZHENG M,WANG X,et al.Fast convergence of detr with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3621-3630.
[77]SUN Z,CAO S,YANG Y,et al.Rethinking transformer-based set prediction for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3611-3620.
[78]DAI X,CHEN Y,YANG J,et al.Dynamic detr:End-to-end object detection with dynamic attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:2988-2997.
[79]LI F,ZHANG H,LIU S,et al.Dn-detr:Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:13619-13627.
[80]DAI Z,CAI B,LIN Y,et al.Up-detr:Unsupervised pre-training for object detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:1601-1610.
[81]ZHENG M,GAO P,ZHANG R,et al.End-to-end object detection with adaptive clustering transformer[J].arXiv:2011.09315,2020.
[82]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[83]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[84]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708.
[85]BEAL J,KIM E,TZENG E,et al.Toward transformer-basedobject detection[J].arXiv:2012.09958,2020.
[86]FANG Y,LIAO B,WANG X,et al.You only look at one se-quence:Rethinking transformer in vision through object detection[J].Advances in Neural Information Processing Systems,2021,34:26183-26197.
[87]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2980-2988.
[88]HE K,GKIOXARI G,DOLLÁR P,et al.Mask r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2961-2969.
[89]TIAN X,WANG L,DING Q.Review of Image Semantic Se-gmentation Based on Deep Learning[J].Journal of Software,2019,30(2):440-468.
[90]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440.
[91]CHEN L C,PAPANDREOU G,KOKKINOS I,et al.Deeplab:Semantic image segmentation with deep convolutional nets,atrous convolution,and fully connected crfs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(4):834-848.
[92]PENG C,ZHANG X,YU G,et al.Large kernel matters--im-prove semantic segmentation by global convolutional network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4353-4361.
[93]LI X,ZHANG L,YOU A,et al.Global aggregation then local distribution in fully convolutional networks[J].arXiv:1909.07229,2019.
[94]STRUDEL R,GARCIA R,LAPTEV I,et al.Segmenter:Transformer for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:7262-7272.
[95]WU S,WU T,LIN F,et al.Fully transformer networks for semantic image segmentation[J].arXiv:2106.04108,2021.
[96]XIE E,WANG W,YU Z,et al.SegFormer:Simple and efficient design for semantic segmentation with transformers[J].Advances in Neural Information Processing Systems,2021,34:12077-12090.
[97]SU L,SUN Y X,YUAN S Z.A Survey of Instance Segmentation Based on Deep Learning[J].CAAI Transactions on Intelligent Systems,2022,17(1):16-31.
[98]HU J,CAO L,LU Y,et al.Istr:End-to-end instance segmentation with transformers[J].arXiv:2105.00637,2021.
[99]GUO R,NIU D,QU L,et al.Sotr:Segmenting objects withtransformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:7157-7166.
[100]WANG Y,XU Z,WANG X,et al.End-to-end video instance segmentation with transformer[C]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.2021:8741-8750.
[101]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Compu-ter-assisted Intervention.2015:234-241.
[102]OKTAY O,SCHLEMPER J,FOLGOC L L,et al.Attentionu-net:Learning where to look for the pancreas[J].arXiv:1804.03999,2018.
[103]CHANG Y,HU M H,ZHAI G T,et al.Transclaw u-net:Claw u-net with transformers for medical image segmentation[J].arXiv:2107.05188,2021.
[104]YAO C,TANG J,HU M,et al.Claw u-net:A unet-based net-work with deep feature concatenation for scleral blood vessel segmentation[J].arXiv:2010.10163,2020.
[105]XU G,WU X,ZHANG X,et al.Levit-unet:Make faster encoders with transformer for medical image segmentation[J].arXiv:2170.08623,2021.
[106]PETIT O,THOME N,RAMBOUR C,et al.U-net transformer:Self and cross attention for medical image segmentation[C]//International Workshop on Machine Learning in Medical Imaging.2021:267-276.
[107]GAO Y,ZHOU M,METAXAS D N.UTNet:a hybrid transformer architecture for medical image segmentation[C]//International Conference on Medical Image Computing and Compu-ter-Assisted Intervention.2021:61-71.
[108]ZHANG Y,LIU H,HU Q.Transfuse:Fusing transformers and cnns for medical image segmentation[C]//International Confe-rence on Medical Image Computing and Computer-Assisted Intervention.2021:14-24.
[109]ZHOU H Y,GUO J,ZHANG Y,et al.nnFormer:Interleaved Transformer for Volumetric Segmentation[J].arXiv:2109.03201,2021.
[110]XIE Y,ZHANG J,SHEN C,et al.Cotr:Efficiently bridging cnn and transformer for 3d medical image segmentation[C]//Internation Alconference on Medical Image Computing and Compu-ter-assisted Intervention.2021:171-180.
[111]HATAMIZADEH A,TANG Y,NATH V,et al.Unetr:Transformers for 3d medical image segmentation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2022:574-584.
[112]VALANARASU J M J,OZA P,HACIHALILOGLU I,et al.Medical transformer:Gated axial-attention for medical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.2021:36-46.
[113]KARIMI D,VASYLECHKO S D,GHOLIPOUR A.Convolution-free medical image segmentation using transformers[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.2021:78-88.
[114]CAO H,WANG Y,CHEN J,et al.Swin-unet:Unet-like puretransformer for medical image segmentation[J].arXiv:2105.05537,2021.
[115]LIN A,CHEN B,XU J,et al.DS-TransUNet:Dual swin Transformer U-Net for medical image segmentation[J].arXiv:2106.06716,2021.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Transformer在计算机视觉场景下的研究综述

Review of Transformer in Computer Vision

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0