视觉Transformer(ViT)发展综述

计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 194-209.doi: 10.11896/jsjkx.240600135

• 计算机图形学&多媒体 • 上一篇下一篇

视觉Transformer(ViT)发展综述

李玉洁^1,2, 马子航¹, 王艺甫¹, 王星河¹, 谭本英^1,2

1 桂林电子科技大学人工智能学院广西桂林 541004
2 广西高校人工智能算法工程重点实验室(桂林电子科技大学) 广西桂林 541004

收稿日期:2024-06-21 修回日期:2024-09-19 出版日期:2025-01-15 发布日期:2025-01-09
通讯作者: 谭本英(by-tan@guet.edu.cn)
作者简介:((yujieli@guet.edu.cn)
基金资助:
广西科技重大专项(桂科AA22068057);广西自然科学基金(2022GXNSFBA035644,2021GXNSFBA220039)

Survey of Vision Transformers(ViT)

LI Yujie^1,2, MA Zihang¹, WANG Yifu¹, WANG Xinghe¹, TAN Benying^1,2

1 School of Artificial Intelligence,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China
2 Key Laboratory of Artificial Intelligence Algorithm Engineering of Guangxi University,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China

Received:2024-06-21 Revised:2024-09-19 Online:2025-01-15 Published:2025-01-09
About author:LI Yujie,born in 1988,Ph.D,associate professor.Her main research interests include sparse representation,optimization,deep learning,computer vision,etc.
TAN Benying,born in 1986,Ph.D,associate professor.His main researchintere-sts include sparse representation,optimization,machine learning,deep lear-ning,image and video processing,etc.
Supported by:
Guangxi Science and Technology Major Project(AA22068057) and Natural Science Foundation of Guangxi,China(2022GXNSFBA035644,2021GXNSFBA220039).

摘要/Abstract

摘要： 视觉Transformer(Vision Transformer,ViT)是基于编码器-解码器结构的Transformer改进模型,已经被成功应用于计算机视觉领域。近几年基于ViT的研究层出不穷且效果显著,基于该模型的工作已经成为计算机视觉任务的重要研究方向,因此针对近年来ViT的发展进行概述。首先,简要回顾了ViT的基本原理及迁移过程,并分析了ViT模型的结构特点和优势;然后,根据各ViT变体模型的改进特点,归纳和梳理了基于ViT的主要骨干网络变体改进方向及其代表性改进模型,包括局部性改进、结构改进、自监督、轻量化及效率改进等改进方向,并对其进行分析比较;最后,讨论了当前ViT及其改进模型仍存在的不足,对ViT未来的研究方向进行了展望。可以作为研究人员进行基于ViT骨干网络的研究时选择深度学习相关方法的一个权衡和参考。

关键词: 计算机视觉, 模式识别, Vision Transformer(ViT), 深度学习, 自注意力

Abstract: The Vision Transformer(ViT),an application of the Transformer architecture with an encoder-decoder structure,has garnered remarkable success in the field of computer vision.Over the past few years,research centered around ViT has witnessed a prolific surge and has consistently exhibited exceptional performance.Therefore,endeavors rooted in this model have evolved into a pivotal and prominent research trajectory within the domain of computer vision tasks.Consequently,this paper seeks to provide a comprehensive survey of the recent advancements and developments in ViT during the recent years.To begin with,it briefly revisits the fundamental principles of the Transformer and its adaptation into ViT,analyzing the structural characteristics and advantages of the ViT model.Then it categorizes and synthesizes the various directions of improvement for ViT backbone networks and their representative improvement models based on the distinguishing features of each ViT variant.These directions include enhancements in locality,structural modifications,self-supervised improvements,and lightweight and efficient improvements,which are thoroughly examined and compared.Lastly,this paper discusses the remaining shortcomings of the current ViT and its enhancement models,while also offering a prospective view on the future research directions for ViT.This comprehensive analysis serves as a valuable reference for researchers when deliberating on the choice of deep learning methodologies for their investigations into ViT backbone networks.

Key words: Computer vision, Pattern recognition, Vision Transformer(ViT), Deep learning, Self-attention

中图分类号:

TP391

李玉洁, 马子航, 王艺甫, 王星河, 谭本英. 视觉Transformer(ViT)发展综述[J]. 计算机科学, 2025, 52(1): 194-209. https://doi.org/10.11896/jsjkx.240600135

LI Yujie, MA Zihang, WANG Yifu, WANG Xinghe, TAN Benying. Survey of Vision Transformers(ViT)[J]. Computer Science, 2025, 52(1): 194-209. https://doi.org/10.11896/jsjkx.240600135

参考文献

[1]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.USA:Curran Associates Inc.,2017:6000-6010.
[3]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[4]OpenAI.GPT-4 technical report[R].CA:OpenAI,2023.
[5]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:transformers for image recognition at scale [C]//International Conference on Learning Representations.OpenReview.net,2021.
[6]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[7]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[8]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Association for Computational Linguistics,2019(1):4171-4186.
[9]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 38th International Conference on Machine Learning.New York:ACM,2021:10347-10357.
[10]LIU Z,LIN Y,CAO Y,et al.Swin Transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:10012-10022.
[11]JIANG L,WANG Z Q,CUI Z Y,et al.Visual Transformerbased on a recurrent structure[J].Journal of Jilin University(Engineering and Technology Edition),2024,54(7):2049-2056.
[12]HE K,CHEN X,XIE S,et al.Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:16000-16009.
[13]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization [EB/OL].http://arxiv.org/abs/1409.2329.
[14]LIU J W,SONG Z Y.Overview of recurrent neural networks[J].Control and Decision,2022,37(11):2753-2768.
[15]GEHRING J,AULI M,GRANGIER D,et al.Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning.PMLR,2017:1243-1252.
[16]SUKHBAATAR S,SZLAM A,WESTON J,et al.End-to-end memory networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.Cambridge:MIT Press,2015:2440-2448.
[17]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2016:770-778.
[18]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//1st International Conference on Learning Representations.OpenReview.net,2013.
[19]WANG A,SINGH A,MICHAEL J,et al.GLUE:A multi-task benchmark and analysis platform for natural language understanding[C]//7th International Conference on Learning Representations.OpenReview.net,2019.
[20]RAJPURKAR P,ZHANG J,LOPYREV K,et al.SQuAD:100,000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2016:2383-2392.
[21]SOCHER R,PERELYGIN A,WU J,et al.Recursive deep mo-dels for semantic compositionality over a sentiment treebank[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2013:1631-1642.
[22]LIU R H,YE X,YUE Z Y.Review of pre-trained models for natural language processing tasks[J].Journal of Computer Applications,2021,41(5):1236-1246.
[23]PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:4055-4064
[24]CHEN M,RADFORD A,CHILD R,et al.Generative pretrai-ning from pixels[C]//Proceedings of the 37th International Conference on Machine Learning.Cambridge:MIT Press,2020:1691-1703.
[25]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[EB/OL].https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
[26]CARION N,MASSA F,SYNNAEVE G,et al.End-to-End object detection with transformers[C]//Computer Vision 16th European Conference.Switzerland:Springer International Publishing,2020:213-229.
[27]RAMACHANDRAN P,PARMAR N,VASWANI A,et al.Stand-alone self-attention in vision models[C]//Annual Conference on Neural Information Processing Systems 2019.Cambridge:MIT Press,2019:68-80.
[28]CORDONNIER J B,LOUKAS A,JAGGI M.On the relation-ship between self-attention and convolutional layers[C]//8th International Conference on Learning Representations.OpenReview.net,2020.
[29]WU B,XU C,DAI X,et al.Visual Transformers:Token-based image representation and processing for computer vision[EB/OL].http://arxiv.org/abs/2006.03677.
[30]HENDRYCKS D,GIMPEL K.Gaussian error linear units(GELUs)[EB/OL].http://arxiv.org/abs/1606.08415.
[31]LARSSON G,MAIRE M,SHAKHNAROVICH G.FractalNet:Ultra-Deep Neural Networks without Residuals[C]//5th International Conference on Learning Representations.OpenReview.net,2017.
[32]CHU X,TIAN Z,ZHANG B,et al.Conditional Positional Encodings for Vision Transformers[C]//The Eleventh International Conference on Learning Representations.OpenReview.net,2023.
[33]WU K,PENG H,CHEN M,et al.Rethinking and ImprovingRelative Position Encoding for Vision Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:10013-10021.
[34]RAGHU M,UNTERTHINER T,KORNBLITH S,et al.Do vision transformers see like convolutional neural networks?[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:12116-12128.
[35]XU K.Study of convolutional neural network applied on image recognition[D].Zhejiang:Zhejiang University,2012.
[36]ZHENG Y P,LI G Y,LI Y.Survey of application of deep lear-ning in image recognition[J].Computer Engineering and Applications,2019,55(12):20-36.
[37]LIN T,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common Objects in Context[C]//Computer Vision－ECCV 2014-13th European Conference.Switzerland:Springer,2014:740-755.
[38]ZHOU B,ZHAO H,PUIG X,et al.Semantic Understanding of Scenes Through the ADE20K Dataset[J].International Journal of Computer Vision,2019,127(3):302-321.
[39]LIN T,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[C]//Proceedings of IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:2999-3007.
[40]XIAO T,LIU Y,ZHOU B,et al.Unified Perceptual Parsing for Scene Understanding[C]//Proceedings of Computer Vision-ECCV 2018-15th European Conference.Switzerland:Springer,2018:432-448.
[41]XIA Z,PAN X,SONG S,et al.Vision transformer with defor-mable attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:4784-4793.
[42]ZHU L,WANG X,KE Z,et al.BiFormer:Vision transformer with bi-Level routing attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2023:10323-10333.
[43]DING M,XIAO B,CODELLA N,et al.DaViT:Dual Attention Vision Transformers[C]//Computer Vision－ECCV 2022－17th European Conference.Switzerland:Springer,2022:74-92.
[44]YAO T,LI Y,PAN Y,et al.HIRI-ViT:Scaling Vision Transformer with High Resolution Inputs[J/OL].https://ieeexplore.ieee.org/document/10475592.
[45]YUAN K,GUO S,LIU Z,et al.Incorporating convolution de-signs into visual transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:579-588.
[46]XIAO T,SINGH M,MINTUN E,et al.Early convolutions help transformers see better[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:30392-30400.
[47]SHI J R,WANG D,SHANG F H,et al.Research advances on stochastic gradient descent algorithm[J].Acta Automatica Sinica,2021,47(9):2103-2119.
[48]CHU Z,CHEN J,CHEN C,et al.DualToken-ViT:Position-aware Efficient Vision Transformer with Dual Token Fusion[EB/OL].http://arxiv.org/abs/2309.12424.
[49]HAN K,XIAO A,WU E,et al.Transformer in transformer[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:15908-15919.
[50]YUAN L,CHEN Y,WANG T,et al.Tokens-to-Token ViT:Training vision transformers from scratch on imageNet[C]//2021 International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:538-547.
[51]LI K,WANG Y,GAO P,et al.UniFormer:Unifying convolution and self-attention for visual recognition [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(10):12581-12600.
[52]FANG J,XIE L,WANG X,et al.MSG-Transformer:Exchan-ging local spatial information by manipulating messenger tokens[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12063-12072.
[53]DONG X,BAO J,CHEN D,et al.CSWin Transformer:A gene-ral vision transformer backbone with cross-shaped windows[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12124-12134.
[54]DAI J,QI H,XIONG Y,et al.Deformable convolutional net-works[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:764-773.
[55]HATAMIZADEH A,YIN H,HEINRICH G,et al.Global context vision transformers[C]//International Conference on Machine Learning.New York:ACM,2023:12633-12646.
[56]HINTON G,VINYALS O,DEAN J.Distilling the Knowledge in a Neural Network [EB/OL].http://arxiv.org/abs/1503.02531.
[57]ABNAR S,DEHGHANI M,ZUIDEMA W.Transferring Inductive Biases through Know-ledge Distillation[EB/OL].http://arxiv.org/abs/ 2006.00555.
[58]WANG W,XIE E,LI X,et al.Pyramid vision transformer:aversatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:568-578.
[59]CHU X,TIAN Z,WANG Y,et al.Twins:Revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:9355-9366.
[60]CHEN R,PANDA R,FAN Q.RegionViT:Regional-to-local attention for vision transformers[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022.
[61]XU Y,ZHANG Q,ZHANG J,et al.ViTAE:Vision Transfor-mer Advanced by Exploring Intrinsic Inductive Bias[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:28522-28535.
[62]ZHANG Q,XU Y,ZHANG J,et al.ViTAEv2:Vision Trans-former Advanced by Exploring Inductive Bias for Image Recognition and Beyond[J].International Journal of Computer Vision,2023,131(5):1141-1162.
[63]ZHOU D,KANG B,JIN X,et al.DeepViT:Towards deeper vision transformer[EB/OL].http://arxiv.org/abs/2103.11886.
[64]TOUVRON H,CORD M,SABLAYROLLES A,et al.Going deeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:32-42.
[65]CHEN C F,FAN Q,PANDA R.CrossViT:Cross-attentionmulti-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:357-366.
[66]YAO T,LI Y,PAN Y,et al.Dual Vision Transformer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(9):10870-10882.
[67]TIAN Y L,WANG Y T,WANG J G,et al.Key problems and progress of vision transformers:the state of the art and prospects[J].Acta Automatica Sinica,2022,48(4):957-979.
[68]BAO H,DONG L,PIAO S,et al.BEiT:BERT pre-training of image transformers[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022.
[69]XIE Z,ZHANG Z,CAO Y,et al.SimMIM:a simple framework for masked image modeling[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:9643-9653.
[70]LIU Z,HU H,LIN Y,et al.Swin Transformer V2:Scaling up capacity and resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2022:12009-12019.
[71]MEHTA S,RASTEGARI M.MobileViT:Light-weight,gene-ral-purpose,and mobile-friendly vision transformer[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022.
[72]SANDLER M,HOWARD A,ZHU M,et al.MobileNetV2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2018:4510-4520.
[73]ZHANG J,PENG H,WU K,et al.MiniViT:Compressing vision transformers with weight multiplexing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12145-12154.
[74]WU K,ZHANG J,PENG H,et al.TinyViT:Fast pretraining distillation for small vision transformers[C]//Computer Vision－ECCV 2022－17th European Conference.Switzerland:Springer,2022:68-85.
[75]LIU Z,LI J,SHEN Z,et al.Learning efficient convolutional networks through network slimming[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:2736-2744.
[76]TANG Y,HAN K,WANG Y,et al.Patch slimming for efficient vision transformers[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12165-12174.
[77]YIN H,VAHDAT A,ALVAREZ J,et al.AdaViT:Adaptive tokens for efficient vision transformer[EB/OL].http://arxiv.org/abs/2112.07658.
[78]XU Y,ZHANG Z,ZHANG M,et al.Evo-ViT:Slow-fast token evolution for dynamic vision transformer[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:2964-2972.
[79]BIAN Z,WANG Z,HAN W,et al.Muti-scale and token mergence:make your ViT more efficient[EB/OL].http://arxiv.org/abs/2306.04897.
[80]BOLYA D,FU C Y,DAI X,et al.Token Merging:Your ViT But Faster[C]//The Eleventh International Conference on Learning Representations.OpenReview.net,2023.
[81]GRAINGER R,PANIAGUA T,SONG X,et al.PaCa-ViT:Learning patch-to-cluster attention in vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:18568-18578.
[82]LIU X,PENG H,ZHENG N,et al.EfficientViT:Memory efficient vision transformer with cascaded group attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:14420-14430.
[83]HAN D,PAN X,HAN Y,et al.FLatten Transformer:Vision Transformer using Focused Linear Attention[C]//IEEE/CVF International Conference on Computer Vision.IEEE,2023:5938-5948.
[84]REN S,WEI F,ZHANG Z,et al.TinyMIM:An empirical study of distilling MIM pre-trained models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:3687-3697.
[85]REDMON J,DIVVALA S K,GIRSHICK R B,et al.You Only Look Once:Unified,Real-Time Object Detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition.IEEE.2016:779-788.
[86]ZHU X,SU W,LU L,et al.Deformable DETR:DeformableTransformers for End-to-End Object Detection[C]//9th International Conference on Learning Representations.OpenReview.net,2021.
[87]ZONG Z,SONG G,LIU Y.DETRs with Collaborative Hybrid Assignments Training[C]//IEEE/CVF International Confe-rence on Computer Vision.IEEE,2023:6725-6735.
[88]ZHAO Y,LV W,XU S,et al.DETRs Beat YOLOs on Real-time Object Detection[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.IEEE,2024.
[89]STRUDEL R,PINEL R G,LAPTEV I,et al.Segmenter:Transformer for Semantic Segmentation[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:7242-7252.
[90]GU J,KWON H,WANG D,et al.Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:12084-12093.
[91]JAIN J,SINGH A,ORLOV N,et al.SeMask:SemanticallyMasked Transformers for Semantic Segmentation[C]//IEEE/CVF International Conference on Computer Vision,ICCV 2023-Workshops.IEEE,2023:752-761.
[92]XIE E,WANG W,YU Z,et al.SegFormer:Simple and Efficient Design for Semantic Segmentation with Transformers[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021,NeurIPS 2021.Cambridge:MIT Press,2021:12077-12090.
[93]QIN Z,LIU J,ZHANG X,et al.Pyramid Fusion Transformer for Semantic Segmentation [J/OL].https://ieeexplore.ieee.org/abstract/document/10540365/.
[94]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:6816-6826.
[95]LI Y,WU C Y,FAN H,et al.MViTv2:Improved Multiscale Vision Transformers for Classification and Detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:4794-4804.
[96]FAN H,XIONG B,MANGALAM K,et al.Multiscale VisionTransformers[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:6804-6815.
[97]KAY W,CARREIRA J,SIMONYAN K,et al.The KineticsHuman Action Video Dataset[EB/OL].http://arxiv.org/abs/1705.06950.
[98]LIU Z,NING J,CAO Y,et al.Video Swin Transformer[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:3192-3201.
[99]MA Y,WANG R.Relative-position embedding based spatially and temporally decoupled Transformer for action recognition[J].Pattern Recognition,2024,145:109905.
[100]SUN W,MA Y,WANG R.k-NN attention-based video vision transformer for action recognition[J].Neurocomputing,2024,574:127256.
[101]SOOMOR K,ZAMIR A R,SHAH M.UCF101:A Dataset of101 Human Actions Classes From Videos in The Wild[EB/OL].http://arxiv.org/abs/1212.0402.
[102]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:Alarge video database for human motion recognition[C]//IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2011:2556-2563.
[103]ZHAO K L,JIN X L,WANG Y Z.Survey on few-shot learning.Journal of Software,2021,32(2):349-369.
[104]STEINER A,KOLESNIKOV A,ZHAI X,et al.How to train your ViT? Data,Augmentation,and Regularization in Vision Transformers[J/OL].https://openreview.net/ pdf?id=4nPswr1KcP.
[105]HO J,JAIN A,ABBEEL P.Denoising Diffusion ProbabilisticModels[C]//Advances in Neural Information Processing Systems 33:Annual Conference on Neural Information Processing Systems 2020.Cambridge:MIT Press,2020:6840-6851.
[106]GANI H,NASEER M,YAQUB M.How to train vision transformer on small-scale datasets?[C]//33rd British Machine Vision Conference 2022.BMVA Press,2022.
[107]SANTORO A,BARTUNOV S,BOTVINICK M,et al.Meta-Learning with Memory-Augmented Neural Networks[C]//Proceedings of the 33nd International Conference on Machine Learning.New York:ACM,2016:1842-1850.
[108]SUN C,SHRIVASTAVA A,SINGH S,et al.Revisiting unreasonable effectiveness of data in deep learning era[C]//Procee-dings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:843-852.

相关文章 0

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

本文评价

推荐阅读 0

No Suggested Reading articles found!