Computer Science ›› 2025, Vol. 52 ›› Issue (1): 194-209.doi: 10.11896/jsjkx.240600135

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Survey of Vision Transformers(ViT)

LI Yujie1,2, MA Zihang1, WANG Yifu1, WANG Xinghe1, TAN Benying1,2   

  1. 1 School of Artificial Intelligence,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China
    2 Key Laboratory of Artificial Intelligence Algorithm Engineering of Guangxi University,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China
  • Received:2024-06-21 Revised:2024-09-19 Online:2025-01-15 Published:2025-01-09
  • About author:LI Yujie,born in 1988,Ph.D,associate professor.Her main research interests include sparse representation,optimization,deep learning,computer vision,etc.
    TAN Benying,born in 1986,Ph.D,associate professor.His main researchintere-sts include sparse representation,optimization,machine learning,deep lear-ning,image and video processing,etc.
  • Supported by:
    Guangxi Science and Technology Major Project(AA22068057) and Natural Science Foundation of Guangxi,China(2022GXNSFBA035644,2021GXNSFBA220039).

Abstract: The Vision Transformer(ViT),an application of the Transformer architecture with an encoder-decoder structure,has garnered remarkable success in the field of computer vision.Over the past few years,research centered around ViT has witnessed a prolific surge and has consistently exhibited exceptional performance.Therefore,endeavors rooted in this model have evolved into a pivotal and prominent research trajectory within the domain of computer vision tasks.Consequently,this paper seeks to provide a comprehensive survey of the recent advancements and developments in ViT during the recent years.To begin with,it briefly revisits the fundamental principles of the Transformer and its adaptation into ViT,analyzing the structural characteristics and advantages of the ViT model.Then it categorizes and synthesizes the various directions of improvement for ViT backbone networks and their representative improvement models based on the distinguishing features of each ViT variant.These directions include enhancements in locality,structural modifications,self-supervised improvements,and lightweight and efficient improvements,which are thoroughly examined and compared.Lastly,this paper discusses the remaining shortcomings of the current ViT and its enhancement models,while also offering a prospective view on the future research directions for ViT.This comprehensive analysis serves as a valuable reference for researchers when deliberating on the choice of deep learning methodologies for their investigations into ViT backbone networks.

Key words: Computer vision, Pattern recognition, Vision Transformer(ViT), Deep learning, Self-attention

CLC Number: 

  • TP391
[1]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.USA:Curran Associates Inc.,2017:6000-6010.
[3]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[4]OpenAI.GPT-4 technical report[R].CA:OpenAI,2023.
[5]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:transformers for image recognition at scale [C]//International Conference on Learning Representations.OpenReview.net,2021.
[6]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[7]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[8]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Association for Computational Linguistics,2019(1):4171-4186.
[9]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 38th International Conference on Machine Learning.New York:ACM,2021:10347-10357.
[10]LIU Z,LIN Y,CAO Y,et al.Swin Transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:10012-10022.
[11]JIANG L,WANG Z Q,CUI Z Y,et al.Visual Transformerbased on a recurrent structure[J].Journal of Jilin University(Engineering and Technology Edition),2024,54(7):2049-2056.
[12]HE K,CHEN X,XIE S,et al.Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:16000-16009.
[13]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization [EB/OL].http://arxiv.org/abs/1409.2329.
[14]LIU J W,SONG Z Y.Overview of recurrent neural networks[J].Control and Decision,2022,37(11):2753-2768.
[15]GEHRING J,AULI M,GRANGIER D,et al.Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning.PMLR,2017:1243-1252.
[16]SUKHBAATAR S,SZLAM A,WESTON J,et al.End-to-end memory networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.Cambridge:MIT Press,2015:2440-2448.
[17]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2016:770-778.
[18]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//1st International Conference on Learning Representations.OpenReview.net,2013.
[19]WANG A,SINGH A,MICHAEL J,et al.GLUE:A multi-task benchmark and analysis platform for natural language understanding[C]//7th International Conference on Learning Representations.OpenReview.net,2019.
[20]RAJPURKAR P,ZHANG J,LOPYREV K,et al.SQuAD:100,000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2016:2383-2392.
[21]SOCHER R,PERELYGIN A,WU J,et al.Recursive deep mo-dels for semantic compositionality over a sentiment treebank[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2013:1631-1642.
[22]LIU R H,YE X,YUE Z Y.Review of pre-trained models for natural language processing tasks[J].Journal of Computer Applications,2021,41(5):1236-1246.
[23]PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:4055-4064
[24]CHEN M,RADFORD A,CHILD R,et al.Generative pretrai-ning from pixels[C]//Proceedings of the 37th International Conference on Machine Learning.Cambridge:MIT Press,2020:1691-1703.
[25]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[EB/OL].https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
[26]CARION N,MASSA F,SYNNAEVE G,et al.End-to-End object detection with transformers[C]//Computer Vision 16th European Conference.Switzerland:Springer International Publishing,2020:213-229.
[27]RAMACHANDRAN P,PARMAR N,VASWANI A,et al.Stand-alone self-attention in vision models[C]//Annual Conference on Neural Information Processing Systems 2019.Cambridge:MIT Press,2019:68-80.
[28]CORDONNIER J B,LOUKAS A,JAGGI M.On the relation-ship between self-attention and convolutional layers[C]//8th International Conference on Learning Representations.OpenReview.net,2020.
[29]WU B,XU C,DAI X,et al.Visual Transformers:Token-based image representation and processing for computer vision[EB/OL].http://arxiv.org/abs/2006.03677.
[30]HENDRYCKS D,GIMPEL K.Gaussian error linear units(GELUs)[EB/OL].http://arxiv.org/abs/1606.08415.
[31]LARSSON G,MAIRE M,SHAKHNAROVICH G.FractalNet:Ultra-Deep Neural Networks without Residuals[C]//5th International Conference on Learning Representations.OpenReview.net,2017.
[32]CHU X,TIAN Z,ZHANG B,et al.Conditional Positional Encodings for Vision Transformers[C]//The Eleventh International Conference on Learning Representations.OpenReview.net,2023.
[33]WU K,PENG H,CHEN M,et al.Rethinking and ImprovingRelative Position Encoding for Vision Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:10013-10021.
[34]RAGHU M,UNTERTHINER T,KORNBLITH S,et al.Do vision transformers see like convolutional neural networks?[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:12116-12128.
[35]XU K.Study of convolutional neural network applied on image recognition[D].Zhejiang:Zhejiang University,2012.
[36]ZHENG Y P,LI G Y,LI Y.Survey of application of deep lear-ning in image recognition[J].Computer Engineering and Applications,2019,55(12):20-36.
[37]LIN T,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common Objects in Context[C]//Computer Vision-ECCV 2014-13th European Conference.Switzerland:Springer,2014:740-755.
[38]ZHOU B,ZHAO H,PUIG X,et al.Semantic Understanding of Scenes Through the ADE20K Dataset[J].International Journal of Computer Vision,2019,127(3):302-321.
[39]LIN T,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[C]//Proceedings of IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:2999-3007.
[40]XIAO T,LIU Y,ZHOU B,et al.Unified Perceptual Parsing for Scene Understanding[C]//Proceedings of Computer Vision-ECCV 2018-15th European Conference.Switzerland:Springer,2018:432-448.
[41]XIA Z,PAN X,SONG S,et al.Vision transformer with defor-mable attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:4784-4793.
[42]ZHU L,WANG X,KE Z,et al.BiFormer:Vision transformer with bi-Level routing attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2023:10323-10333.
[43]DING M,XIAO B,CODELLA N,et al.DaViT:Dual Attention Vision Transformers[C]//Computer Vision-ECCV 2022-17th European Conference.Switzerland:Springer,2022:74-92.
[44]YAO T,LI Y,PAN Y,et al.HIRI-ViT:Scaling Vision Transformer with High Resolution Inputs[J/OL].https://ieeexplore.ieee.org/document/10475592.
[45]YUAN K,GUO S,LIU Z,et al.Incorporating convolution de-signs into visual transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:579-588.
[46]XIAO T,SINGH M,MINTUN E,et al.Early convolutions help transformers see better[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:30392-30400.
[47]SHI J R,WANG D,SHANG F H,et al.Research advances on stochastic gradient descent algorithm[J].Acta Automatica Sinica,2021,47(9):2103-2119.
[48]CHU Z,CHEN J,CHEN C,et al.DualToken-ViT:Position-aware Efficient Vision Transformer with Dual Token Fusion[EB/OL].http://arxiv.org/abs/2309.12424.
[49]HAN K,XIAO A,WU E,et al.Transformer in transformer[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:15908-15919.
[50]YUAN L,CHEN Y,WANG T,et al.Tokens-to-Token ViT:Training vision transformers from scratch on imageNet[C]//2021 International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:538-547.
[51]LI K,WANG Y,GAO P,et al.UniFormer:Unifying convolution and self-attention for visual recognition [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(10):12581-12600.
[52]FANG J,XIE L,WANG X,et al.MSG-Transformer:Exchan-ging local spatial information by manipulating messenger tokens[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12063-12072.
[53]DONG X,BAO J,CHEN D,et al.CSWin Transformer:A gene-ral vision transformer backbone with cross-shaped windows[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12124-12134.
[54]DAI J,QI H,XIONG Y,et al.Deformable convolutional net-works[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:764-773.
[55]HATAMIZADEH A,YIN H,HEINRICH G,et al.Global context vision transformers[C]//International Conference on Machine Learning.New York:ACM,2023:12633-12646.
[56]HINTON G,VINYALS O,DEAN J.Distilling the Knowledge in a Neural Network [EB/OL].http://arxiv.org/abs/1503.02531.
[57]ABNAR S,DEHGHANI M,ZUIDEMA W.Transferring Inductive Biases through Know-ledge Distillation[EB/OL].http://arxiv.org/abs/ 2006.00555.
[58]WANG W,XIE E,LI X,et al.Pyramid vision transformer:aversatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:568-578.
[59]CHU X,TIAN Z,WANG Y,et al.Twins:Revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:9355-9366.
[60]CHEN R,PANDA R,FAN Q.RegionViT:Regional-to-local attention for vision transformers[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022.
[61]XU Y,ZHANG Q,ZHANG J,et al.ViTAE:Vision Transfor-mer Advanced by Exploring Intrinsic Inductive Bias[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021.Cambridge:MIT Press,2021:28522-28535.
[62]ZHANG Q,XU Y,ZHANG J,et al.ViTAEv2:Vision Trans-former Advanced by Exploring Inductive Bias for Image Recognition and Beyond[J].International Journal of Computer Vision,2023,131(5):1141-1162.
[63]ZHOU D,KANG B,JIN X,et al.DeepViT:Towards deeper vision transformer[EB/OL].http://arxiv.org/abs/2103.11886.
[64]TOUVRON H,CORD M,SABLAYROLLES A,et al.Going deeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:32-42.
[65]CHEN C F,FAN Q,PANDA R.CrossViT:Cross-attentionmulti-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:357-366.
[66]YAO T,LI Y,PAN Y,et al.Dual Vision Transformer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(9):10870-10882.
[67]TIAN Y L,WANG Y T,WANG J G,et al.Key problems and progress of vision transformers:the state of the art and prospects[J].Acta Automatica Sinica,2022,48(4):957-979.
[68]BAO H,DONG L,PIAO S,et al.BEiT:BERT pre-training of image transformers[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022.
[69]XIE Z,ZHANG Z,CAO Y,et al.SimMIM:a simple framework for masked image modeling[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:9643-9653.
[70]LIU Z,HU H,LIN Y,et al.Swin Transformer V2:Scaling up capacity and resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2022:12009-12019.
[71]MEHTA S,RASTEGARI M.MobileViT:Light-weight,gene-ral-purpose,and mobile-friendly vision transformer[C]//The Tenth International Conference on Learning Representations.OpenReview.net,2022.
[72]SANDLER M,HOWARD A,ZHU M,et al.MobileNetV2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2018:4510-4520.
[73]ZHANG J,PENG H,WU K,et al.MiniViT:Compressing vision transformers with weight multiplexing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12145-12154.
[74]WU K,ZHANG J,PENG H,et al.TinyViT:Fast pretraining distillation for small vision transformers[C]//Computer Vision-ECCV 2022-17th European Conference.Switzerland:Springer,2022:68-85.
[75]LIU Z,LI J,SHEN Z,et al.Learning efficient convolutional networks through network slimming[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:2736-2744.
[76]TANG Y,HAN K,WANG Y,et al.Patch slimming for efficient vision transformers[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2022:12165-12174.
[77]YIN H,VAHDAT A,ALVAREZ J,et al.AdaViT:Adaptive tokens for efficient vision transformer[EB/OL].http://arxiv.org/abs/2112.07658.
[78]XU Y,ZHANG Z,ZHANG M,et al.Evo-ViT:Slow-fast token evolution for dynamic vision transformer[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:2964-2972.
[79]BIAN Z,WANG Z,HAN W,et al.Muti-scale and token mergence:make your ViT more efficient[EB/OL].http://arxiv.org/abs/2306.04897.
[80]BOLYA D,FU C Y,DAI X,et al.Token Merging:Your ViT But Faster[C]//The Eleventh International Conference on Learning Representations.OpenReview.net,2023.
[81]GRAINGER R,PANIAGUA T,SONG X,et al.PaCa-ViT:Learning patch-to-cluster attention in vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:18568-18578.
[82]LIU X,PENG H,ZHENG N,et al.EfficientViT:Memory efficient vision transformer with cascaded group attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:14420-14430.
[83]HAN D,PAN X,HAN Y,et al.FLatten Transformer:Vision Transformer using Focused Linear Attention[C]//IEEE/CVF International Conference on Computer Vision.IEEE,2023:5938-5948.
[84]REN S,WEI F,ZHANG Z,et al.TinyMIM:An empirical study of distilling MIM pre-trained models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2023:3687-3697.
[85]REDMON J,DIVVALA S K,GIRSHICK R B,et al.You Only Look Once:Unified,Real-Time Object Detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition.IEEE.2016:779-788.
[86]ZHU X,SU W,LU L,et al.Deformable DETR:DeformableTransformers for End-to-End Object Detection[C]//9th International Conference on Learning Representations.OpenReview.net,2021.
[87]ZONG Z,SONG G,LIU Y.DETRs with Collaborative Hybrid Assignments Training[C]//IEEE/CVF International Confe-rence on Computer Vision.IEEE,2023:6725-6735.
[88]ZHAO Y,LV W,XU S,et al.DETRs Beat YOLOs on Real-time Object Detection[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.IEEE,2024.
[89]STRUDEL R,PINEL R G,LAPTEV I,et al.Segmenter:Transformer for Semantic Segmentation[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:7242-7252.
[90]GU J,KWON H,WANG D,et al.Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:12084-12093.
[91]JAIN J,SINGH A,ORLOV N,et al.SeMask:SemanticallyMasked Transformers for Semantic Segmentation[C]//IEEE/CVF International Conference on Computer Vision,ICCV 2023-Workshops.IEEE,2023:752-761.
[92]XIE E,WANG W,YU Z,et al.SegFormer:Simple and Efficient Design for Semantic Segmentation with Transformers[C]//Advances in Neural Information Processing Systems 34:Annual Conference on Neural Information Processing Systems 2021,NeurIPS 2021.Cambridge:MIT Press,2021:12077-12090.
[93]QIN Z,LIU J,ZHANG X,et al.Pyramid Fusion Transformer for Semantic Segmentation [J/OL].https://ieeexplore.ieee.org/abstract/document/10540365/.
[94]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:6816-6826.
[95]LI Y,WU C Y,FAN H,et al.MViTv2:Improved Multiscale Vision Transformers for Classification and Detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:4794-4804.
[96]FAN H,XIONG B,MANGALAM K,et al.Multiscale VisionTransformers[C]//2021 IEEE/CVF International Conference on Computer Vision.IEEE,2021:6804-6815.
[97]KAY W,CARREIRA J,SIMONYAN K,et al.The KineticsHuman Action Video Dataset[EB/OL].http://arxiv.org/abs/1705.06950.
[98]LIU Z,NING J,CAO Y,et al.Video Swin Transformer[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:3192-3201.
[99]MA Y,WANG R.Relative-position embedding based spatially and temporally decoupled Transformer for action recognition[J].Pattern Recognition,2024,145:109905.
[100]SUN W,MA Y,WANG R.k-NN attention-based video vision transformer for action recognition[J].Neurocomputing,2024,574:127256.
[101]SOOMOR K,ZAMIR A R,SHAH M.UCF101:A Dataset of101 Human Actions Classes From Videos in The Wild[EB/OL].http://arxiv.org/abs/1212.0402.
[102]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:Alarge video database for human motion recognition[C]//IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2011:2556-2563.
[103]ZHAO K L,JIN X L,WANG Y Z.Survey on few-shot learning.Journal of Software,2021,32(2):349-369.
[104]STEINER A,KOLESNIKOV A,ZHAI X,et al.How to train your ViT? Data,Augmentation,and Regularization in Vision Transformers[J/OL].https://openreview.net/ pdf?id=4nPswr1KcP.
[105]HO J,JAIN A,ABBEEL P.Denoising Diffusion ProbabilisticModels[C]//Advances in Neural Information Processing Systems 33:Annual Conference on Neural Information Processing Systems 2020.Cambridge:MIT Press,2020:6840-6851.
[106]GANI H,NASEER M,YAQUB M.How to train vision transformer on small-scale datasets?[C]//33rd British Machine Vision Conference 2022.BMVA Press,2022.
[107]SANTORO A,BARTUNOV S,BOTVINICK M,et al.Meta-Learning with Memory-Augmented Neural Networks[C]//Proceedings of the 33nd International Conference on Machine Learning.New York:ACM,2016:1842-1850.
[108]SUN C,SHRIVASTAVA A,SINGH S,et al.Revisiting unreasonable effectiveness of data in deep learning era[C]//Procee-dings of the IEEE International Conference on Computer Vision.Piscataway:IEEE Computer Society,2017:843-852.
[1] ZHANG Jian, LI Hui, ZHANG Shengming, WU Jie, PENG Ying. Review of Pre-training Methods for Visually-rich Document Understanding [J]. Computer Science, 2025, 52(1): 259-276.
[2] LI Yahe, XIE Zhipeng. Active Learning Based on Maximum Influence Set [J]. Computer Science, 2025, 52(1): 289-297.
[3] ZHANG Xin, ZHANG Han, NIU Manyu, JI Lixia. Adversarial Sample Detection in Computer Vision:A Survey [J]. Computer Science, 2025, 52(1): 345-361.
[4] SU Chaoran, ZHANG Dalong, HUANG Yong, DONG An. RF Fingerprint Recognition Based on SE Attention Multi-source Domain Adversarial Network [J]. Computer Science, 2025, 52(1): 412-419.
[5] ZHANG Yusong, XU Shuai, YAN Xingyu, GUAN Donghai, XU Jianqiu. Survey on Cross-city Human Mobility Prediction [J]. Computer Science, 2025, 52(1): 102-119.
[6] LIU Yuming, DAI Yu, CHEN Gongping. Review of Federated Learning in Medical Image Processing [J]. Computer Science, 2025, 52(1): 183-193.
[7] ZHU Xiaoyan, WANG Wenge, WANG Jiayin, ZHANG Xuanping. Just-In-Time Software Defect Prediction Approach Based on Fine-grained Code Representationand Feature Fusion [J]. Computer Science, 2025, 52(1): 242-249.
[8] DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei. Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs [J]. Computer Science, 2024, 51(9): 71-79.
[9] XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120.
[10] ZHU Fukun, TENG Zhen, SHAO Wenze, GE Qi, SUN Yubao. Semantic-guided Neural Network Critical Data Routing Path [J]. Computer Science, 2024, 51(9): 155-161.
[11] SUN Yumo, LI Xinhang, ZHAO Wenjie, ZHU Li, LIANG Ya’nan. Driving Towards Intelligent Future:The Application of Deep Learning in Rail Transit Innovation [J]. Computer Science, 2024, 51(8): 1-10.
[12] KONG Lingchao, LIU Guozhu. Review of Outlier Detection Algorithms [J]. Computer Science, 2024, 51(8): 20-33.
[13] TANG Ruiqi, XIAO Ting, CHI Ziqiu, WANG Zhe. Few-shot Image Classification Based on Pseudo-label Dependence Enhancement and NoiseInterferenceReduction [J]. Computer Science, 2024, 51(8): 152-159.
[14] XIAO Xiao, BAI Zhengyao, LI Zekai, LIU Xuheng, DU Jiajin. Parallel Multi-scale with Attention Mechanism for Point Cloud Upsampling [J]. Computer Science, 2024, 51(8): 183-191.
[15] ZHANG Junsan, CHENG Ming, SHEN Xiuxuan, LIU Yuxue, WANG Leiquan. Diversified Label Matrix Based Medical Image Report Generation [J]. Computer Science, 2024, 51(8): 200-208.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!