计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 194-209.doi: 10.11896/jsjkx.240600135
李玉洁1,2, 马子航1, 王艺甫1, 王星河1, 谭本英1,2
LI Yujie1,2, MA Zihang1, WANG Yifu1, WANG Xinghe1, TAN Benying1,2
摘要: 视觉Transformer(Vision Transformer,ViT)是基于编码器-解码器结构的Transformer改进模型,已经被成功应用于计算机视觉领域。近几年基于ViT的研究层出不穷且效果显著,基于该模型的工作已经成为计算机视觉任务的重要研究方向,因此针对近年来ViT的发展进行概述。首先,简要回顾了ViT的基本原理及迁移过程,并分析了ViT模型的结构特点和优势;然后,根据各ViT变体模型的改进特点,归纳和梳理了基于ViT的主要骨干网络变体改进方向及其代表性改进模型,包括局部性改进、结构改进、自监督、轻量化及效率改进等改进方向,并对其进行分析比较;最后,讨论了当前ViT及其改进模型仍存在的不足,对ViT未来的研究方向进行了展望。可以作为研究人员进行基于ViT骨干网络的研究时选择深度学习相关方法的一个权衡和参考。
中图分类号:
[1]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444. [2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.USA:Curran Associates Inc.,2017:6000-6010. [3]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. [4]OpenAI.GPT-4 technical report[R].CA:OpenAI,2023. [5]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:transformers for image recognition at scale [C]//International Conference on Learning Representations.OpenReview.net,2021. [6]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255. [7]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324. [8]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Association for Computational Linguistics,2019(1):4171-4186. [9]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 38th International Conference on Machine Learning.New York:ACM,2021:10347-10357. [10]LIU Z,LIN Y,CAO Y,et al.Swin Transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:10012-10022. [11]JIANG L,WANG Z Q,CUI Z Y,et al.Visual Transformerbased on a recurrent structure[J].Journal of Jilin University(Engineering and Technology Edition),2024,54(7):2049-2056. [12]HE K,CHEN X,XIE S,et al.Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:16000-16009. [13]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization [EB/OL].http://arxiv.org/abs/1409.2329. [14]LIU J W,SONG Z Y.Overview of recurrent neural networks[J].Control and Decision,2022,37(11):2753-2768. [15]GEHRING J,AULI M,GRANGIER D,et al.Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning.PMLR,2017:1243-1252. [16]SUKHBAATAR S,SZLAM A,WESTON J,et al.End-to-end memory networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.Cambridge:MIT Press,2015:2440-2448. [17]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2016:770-778. [18]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//1st International Conference on Learning Representations.OpenReview.net,2013. [19]WANG A,SINGH A,MICHAEL J,et al.GLUE:A multi-task benchmark and analysis platform for natural language understanding[C]//7th International Conference on Learning Representations.OpenReview.net,2019. [20]RAJPURKAR P,ZHANG J,LOPYREV K,et al.SQuAD:100,000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2016:2383-2392. [21]SOCHER R,PERELYGIN A,WU J,et al.Recursive deep mo-dels for semantic compositionality over a sentiment treebank[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2013:1631-1642. [22]LIU R H,YE X,YUE Z Y.Review of pre-trained models for natural language processing tasks[J].Journal of Computer Applications,2021,41(5):1236-1246. [23]PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:4055-4064 [24]CHEN M,RADFORD A,CHILD R,et al.Generative pretrai-ning from pixels[C]//Proceedings of the 37th International Conference on Machine Learning.Cambridge:MIT Press,2020:1691-1703. [25]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[EB/OL].https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. [26]CARION N,MASSA F,SYNNAEVE G,et al.End-to-End object detection with transformers[C]//Computer Vision 16th European Conference.Switzerland:Springer International Publishing,2020:213-229. [27]RAMACHANDRAN P,PARMAR N,VASWANI A,et al.Stand-alone self-attention in vision models[C]//Annual Conference on Neural Information Processing Systems 2019.Cambridge:MIT Press,2019:68-80. [28]CORDONNIER J B,LOUKAS A,JAGGI M.On the relation-ship between self-attention and convolutional layers[C]//8th International Conference on Learning Representations.OpenReview.net,2020. [29]WU B,XU C,DAI X,et al.Visual Transformers:Token-based image representation and processing for computer vision[EB/OL].http://arxiv.org/abs/2006.03677. [30]HENDRYCKS D,GIMPEL K.Gaussian error linear units(GELUs)[EB/OL].http://arxiv.org/abs/1606.08415. [31]LARSSON G,MAIRE M,SHAKHNAROVICH G.FractalNet:Ultra-Deep Neural Networks without Residuals[C]//5th International Conference on Learning Representations.OpenReview.net,2017. [32]CHU X,TIAN Z,ZHANG B,et al.Conditional Positional Encodings for Vision Transformers[C]//The Eleventh International Conference on Learning Representations.OpenReview.net,2023. [33]WU K,PENG H,CHEN M,et al.Rethinking and ImprovingRelative Position Encoding for Vision Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:10013-10021. [34]RAGHU M,UNTERTHINER T,KORNBLITH S,et al.Do vision transformers see like convolutional neural networks?[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:12116-12128. [35]XU K.Study of convolutional neural network applied on image recognition[D].Zhejiang:Zhejiang University,2012. [36]ZHENG Y P,LI G Y,LI Y.Survey of application of deep lear-ning in image recognition[J].Computer Engineering and Applications,2019,55(12):20-36. [37]LIN T,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common Objects in Context[C]//Computer Vision-ECCV 2014-13th European Conference.Switzerland:Springer,2014:740-755. [38]ZHOU B,ZHAO H,PUIG X,et al.Semantic Understanding of Scenes Through the ADE20K Dataset[J].International Journal of Computer Vision,2019,127(3):302-321. [39]LIN T,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[C]//Proceedings of IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:2999-3007. [40]XIAO T,LIU Y,ZHOU B,et al.Unified Perceptual Parsing for Scene Understanding[C]//Proceedings of Computer Vision-ECCV 2018-15th European Conference.Switzerland:Springer,2018:432-448. [41]XIA Z,PAN X,SONG S,et al.Vision transformer with defor-mable attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:4784-4793. [42]ZHU L,WANG X,KE Z,et al.BiFormer:Vision transformer with bi-Level routing attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2023:10323-10333. [43]DING M,XIAO B,CODELLA N,et al.DaViT:Dual Attention Vision Transformers[C]//Computer Vision-ECCV 2022-17th European Conference.Switzerland:Springer,2022:74-92. [44]YAO T,LI Y,PAN Y,et al.HIRI-ViT:Scaling Vision Transformer with High Resolution Inputs[J/OL].https://ieeexplore.ieee.org/document/10475592. [45]YUAN K,GUO S,LIU Z,et al.Incorporating convolution de-signs into visual transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:579-588. [46]XIAO T,SINGH M,MINTUN E,et al.Early convolutions help transformers see better[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:30392-30400. [47]SHI J R,WANG D,SHANG F H,et al.Research advances on stochastic gradient descent algorithm[J].Acta Automatica Sinica,2021,47(9):2103-2119. [48]CHU Z,CHEN J,CHEN C,et al.DualToken-ViT:Position-aware Efficient Vision Transformer with Dual Token Fusion[EB/OL].http://arxiv.org/abs/2309.12424. [49]HAN K,XIAO A,WU E,et al.Transformer in transformer[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:15908-15919. [50]YUAN L,CHEN Y,WANG T,et al.Tokens-to-Token ViT:Training vision transformers from scratch on imageNet[C]//2021 International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:538-547. [51]LI K,WANG Y,GAO P,et al.UniFormer:Unifying convolution and self-attention for visual recognition [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(10):12581-12600. [52]FANG J,XIE L,WANG X,et al.MSG-Transformer:Exchan-ging local spatial information by manipulating messenger tokens[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12063-12072. [53]DONG X,BAO J,CHEN D,et al.CSWin Transformer:A gene-ral vision transformer backbone with cross-shaped windows[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12124-12134. [54]DAI J,QI H,XIONG Y,et al.Deformable convolutional net-works[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:764-773. [55]HATAMIZADEH A,YIN H,HEINRICH G,et al.Global context vision transformers[C]//International Conference on Machine Learning.New York:ACM,2023:12633-12646. [56]HINTON G,VINYALS O,DEAN J.Distilling the Knowledge in a Neural Network [EB/OL].http://arxiv.org/abs/1503.02531. [57]ABNAR S,DEHGHANI M,ZUIDEMA W.Transferring Inductive Biases through Know-ledge Distillation[EB/OL].http://arxiv.org/abs/ 2006.00555. [58]WANG W,XIE E,LI X,et al.Pyramid vision transformer:aversatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:568-578. [59]CHU X,TIAN Z,WANG Y,et al.Twins:Revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:9355-9366. [60]CHEN R,PANDA R,FAN Q.RegionViT:Regional-to-local attention for vision transformers[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022. [61]XU Y,ZHANG Q,ZHANG J,et al.ViTAE:Vision Transfor-mer Advanced by Exploring Intrinsic Inductive Bias[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:28522-28535. [62]ZHANG Q,XU Y,ZHANG J,et al.ViTAEv2:Vision Trans-former Advanced by Exploring Inductive Bias for Image Recognition and Beyond[J].International Journal of Computer Vision,2023,131(5):1141-1162. [63]ZHOU D,KANG B,JIN X,et al.DeepViT:Towards deeper vision transformer[EB/OL].http://arxiv.org/abs/2103.11886. [64]TOUVRON H,CORD M,SABLAYROLLES A,et al.Going deeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:32-42. [65]CHEN C F,FAN Q,PANDA R.CrossViT:Cross-attentionmulti-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:357-366. [66]YAO T,LI Y,PAN Y,et al.Dual Vision Transformer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(9):10870-10882. [67]TIAN Y L,WANG Y T,WANG J G,et al.Key problems and progress of vision transformers:the state of the art and prospects[J].Acta Automatica Sinica,2022,48(4):957-979. [68]BAO H,DONG L,PIAO S,et al.BEiT:BERT pre-training of image transformers[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022. [69]XIE Z,ZHANG Z,CAO Y,et al.SimMIM:a simple framework for masked image modeling[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:9643-9653. [70]LIU Z,HU H,LIN Y,et al.Swin Transformer V2:Scaling up capacity and resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2022:12009-12019. [71]MEHTA S,RASTEGARI M.MobileViT:Light-weight,gene-ral-purpose,and mobile-friendly vision transformer[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022. [72]SANDLER M,HOWARD A,ZHU M,et al.MobileNetV2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2018:4510-4520. [73]ZHANG J,PENG H,WU K,et al.MiniViT:Compressing vision transformers with weight multiplexing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12145-12154. [74]WU K,ZHANG J,PENG H,et al.TinyViT:Fast pretraining distillation for small vision transformers[C]//Computer Vision-ECCV 2022-17th European Conference.Switzerland:Springer,2022:68-85. [75]LIU Z,LI J,SHEN Z,et al.Learning efficient convolutional networks through network slimming[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:2736-2744. [76]TANG Y,HAN K,WANG Y,et al.Patch slimming for efficient vision transformers[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12165-12174. [77]YIN H,VAHDAT A,ALVAREZ J,et al.AdaViT:Adaptive tokens for efficient vision transformer[EB/OL].http://arxiv.org/abs/2112.07658. [78]XU Y,ZHANG Z,ZHANG M,et al.Evo-ViT:Slow-fast token evolution for dynamic vision transformer[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:2964-2972. [79]BIAN Z,WANG Z,HAN W,et al.Muti-scale and token mergence:make your ViT more efficient[EB/OL].http://arxiv.org/abs/2306.04897. [80]BOLYA D,FU C Y,DAI X,et al.Token Merging:Your ViT But Faster[C]//The Eleventh International Conference on Learning Representations.OpenReview.net,2023. [81]GRAINGER R,PANIAGUA T,SONG X,et al.PaCa-ViT:Learning patch-to-cluster attention in vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:18568-18578. [82]LIU X,PENG H,ZHENG N,et al.EfficientViT:Memory efficient vision transformer with cascaded group attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:14420-14430. [83]HAN D,PAN X,HAN Y,et al.FLatten Transformer:Vision Transformer using Focused Linear Attention[C]//IEEE/CVF International Conference on Computer Vision.IEEE,2023:5938-5948. [84]REN S,WEI F,ZHANG Z,et al.TinyMIM:An empirical study of distilling MIM pre-trained models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:3687-3697. [85]REDMON J,DIVVALA S K,GIRSHICK R B,et al.You Only Look Once:Unified,Real-Time Object Detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition.IEEE.2016:779-788. [86]ZHU X,SU W,LU L,et al.Deformable DETR:DeformableTransformers for End-to-End Object Detection[C]//9th International Conference on Learning Representations.OpenReview.net,2021. [87]ZONG Z,SONG G,LIU Y.DETRs with Collaborative Hybrid Assignments Training[C]//IEEE/CVF International Confe-rence on Computer Vision.IEEE,2023:6725-6735. [88]ZHAO Y,LV W,XU S,et al.DETRs Beat YOLOs on Real-time Object Detection[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.IEEE,2024. [89]STRUDEL R,PINEL R G,LAPTEV I,et al.Segmenter:Transformer for Semantic Segmentation[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:7242-7252. [90]GU J,KWON H,WANG D,et al.Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:12084-12093. [91]JAIN J,SINGH A,ORLOV N,et al.SeMask:SemanticallyMasked Transformers for Semantic Segmentation[C]//IEEE/CVF International Conference on Computer Vision,ICCV 2023-Workshops.IEEE,2023:752-761. [92]XIE E,WANG W,YU Z,et al.SegFormer:Simple and Efficient Design for Semantic Segmentation with Transformers[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021,NeurIPS 2021.Cambridge:MIT Press,2021:12077-12090. [93]QIN Z,LIU J,ZHANG X,et al.Pyramid Fusion Transformer for Semantic Segmentation [J/OL].https://ieeexplore.ieee.org/abstract/document/10540365/. [94]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:6816-6826. [95]LI Y,WU C Y,FAN H,et al.MViTv2:Improved Multiscale Vision Transformers for Classification and Detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:4794-4804. [96]FAN H,XIONG B,MANGALAM K,et al.Multiscale VisionTransformers[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:6804-6815. [97]KAY W,CARREIRA J,SIMONYAN K,et al.The KineticsHuman Action Video Dataset[EB/OL].http://arxiv.org/abs/1705.06950. [98]LIU Z,NING J,CAO Y,et al.Video Swin Transformer[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:3192-3201. [99]MA Y,WANG R.Relative-position embedding based spatially and temporally decoupled Transformer for action recognition[J].Pattern Recognition,2024,145:109905. [100]SUN W,MA Y,WANG R.k-NN attention-based video vision transformer for action recognition[J].Neurocomputing,2024,574:127256. [101]SOOMOR K,ZAMIR A R,SHAH M.UCF101:A Dataset of101 Human Actions Classes From Videos in The Wild[EB/OL].http://arxiv.org/abs/1212.0402. [102]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:Alarge video database for human motion recognition[C]//IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2011:2556-2563. [103]ZHAO K L,JIN X L,WANG Y Z.Survey on few-shot learning.Journal of Software,2021,32(2):349-369. [104]STEINER A,KOLESNIKOV A,ZHAI X,et al.How to train your ViT? Data,Augmentation,and Regularization in Vision Transformers[J/OL].https://openreview.net/ pdf?id=4nPswr1KcP. [105]HO J,JAIN A,ABBEEL P.Denoising Diffusion ProbabilisticModels[C]//Advances in Neural Information Processing Systems 33:Annual Conference on Neural Information Processing Systems 2020.Cambridge:MIT Press,2020:6840-6851. [106]GANI H,NASEER M,YAQUB M.How to train vision transformer on small-scale datasets?[C]//33rd British Machine Vision Conference 2022.BMVA Press,2022. [107]SANTORO A,BARTUNOV S,BOTVINICK M,et al.Meta-Learning with Memory-Augmented Neural Networks[C]//Proceedings of the 33nd International Conference on Machine Learning.New York:ACM,2016:1842-1850. [108]SUN C,SHRIVASTAVA A,SINGH S,et al.Revisiting unreasonable effectiveness of data in deep learning era[C]//Procee-dings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:843-852. |
|