面向视觉Transformer的对抗样本攻击综述

doi:10.11896/jsjkx.250600065

Abstract

Abstract: Vision Transformer(ViT)is a novel architecture that breaks through the local receptive field limitation of traditional convolutional neural network and has made breakthrough progress in the field of computer vision with its global modeling capability.With the surging application of ViT in the security field,the structural differences between it and CNN have sharply reduced the effectiveness of traditional adversarial attacks,leading to the masking of the true vulnerability of ViT and the lag in the development of defense mechanisms.The model security risks triggered by adversarial attacks are driving research in this field to become a hot topic.This paper first systematically reviews the core progress of ViT adversarial attack methods,and analyzes the influence of ViT-specific structures such as image blocking,position coding,and attention mechanisms on adversarial sample attacks.Secondly,it classifies the adversarial attack methods for ViT,and divides the existing key attack methods into white-box attacks,migration-based black-box attacks,and decision-based black-box attacks.And it focuses on introducing the research progress of five types of black-box migration attacks,namely,optimization attacks on model structure,attacks based on input transformation,attacks based on integral gradient,attacks on downstream tasks,and attacks on model alignment.Then,the gradual evolution of different methods in terms of disturbance efficiency and cross-model migration is deeply explored.The core advantages and disadvantages of various attack methods are systematically summarized,revealing the evolution logic of attack techniques and model defects to provide references for the innovation of offensive and defensive technologies.Finally,the future research directions are analyzed and prospected.

Key words: Deep learning, Vision Transformer, Adversarial sample attack, Image classification, Self-attention

CLC Number:

TP391

GUO Jingchen, YANG Kuiwu, DING Mengdi, WEI Jianghong. Survey of Adversarial Sample Attacks for Vision Transformer[J].Computer Science, 2026, 53(5): 404-418.

References

[1]SZEGEDY C,ZAREMBA W,SUTSKEVER I,et al.Intriguing properties of neural networks[C]//Proceedings of the 2nd International Conference on Learning Representations(ICLR).La Jolla,CA:LCLR,2014.
[2]GU J,TRESP V,QIN Y.Are Vision Transformers Robust to Patch Perturbations?[C]//Computer Vision－ECCV 2022.Cham:Springer Nature Switzerland,2022:404-421.
[3]FU Y,ZHANG S,WU S,et al.Patch-Fool:Are Vision Transformers Always Robust Against Adversarial Perturbations?[J].arXiv:2203.08392,2022.
[4]WEI Z,CHEN J,GOLDBLUM M,et al.Towards Transferable Adversarial Attacks on Vision Transformers[J].Proceedings of the AAAI Conference on Artificial Intelligence,2022,36(3):2668-2676.
[5]MAHMOOD K,MAHMOOD R,VAN DIJK M.On the Robustness of Vision Transformers to Adversarial Examples[C]//2021 IEEE/CVF International Conference on Computer Vision(ICCV).IEEE,2021:7818-7827.
[6]BHANUSHALI A R,MUN H,YUN J.Adversarial Attacks on Automatic Speech Recognition(ASR):A Survey[J].IEEE Access,2024,12:88279-88302.
[7]XU D Y,TIAN Y Z,CHEN K,et al.Survey on Adversarial Attack and Defense for Signal Modulation Recognition[J].Computer Research and Development,2025,62(7):1713-1737.
[8]LIU D,YANG M,QU X,et al.A Survey of Attacks on Large Vision-Language Models:Resources,Advances,and Future Trends[J].arXiv:2407.07403,2024.
[9]FAWOLE O,RAWAT D.Recent Advances in Vision Trans-former Robustness Against Adversarial Attacks in Traffic Sign Detection and Recognition:A Survey[J].ACM Computing Surveys,2025,57(10):1-33.
[10]GOYAL S,DODDAPANENI S,KHAPRA M M,et al.A Survey of Adversarial Defenses and Robustness in NLP[J].ACM Computing Surveys,2023,55(14s):1-39.
[11]KHURANA D,KOLI A,KHATTER K,et al.Natural language processing:state of the art,current trends and challenges[J].Multimedia Tools and Applications,2023,82(3):3713-3744.
[12]KANCA E,AYAS S,BAYKAL KABLAN E,et al.Evaluating and enhancing the robustness of vision transformers against adversarial attacks in medical imaging[J].Medical & Biological Engineering & Computing,2025,63(3):673-690.
[13]MADRY A,MAKELOV A,SCHMIDT L,et al.Towards Deep Learning Models Resistant to Adversarial Attacks[J].arXiv:1706.06083,2019.
[14]CARLINI N,WAGNER D.Towards Evaluating the Robustness of Neural Networks[J].arXiv:1608.04644,2017.
[15]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[16]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale[J].arXiv:2010.11929,2020.
[17]ZHOU D,KANG B,JIN X,et al.DeepViT:Towards Deeper Vision Transformer[J].arXiv:2103.11886,2021.
[18]HUANG T,HUANG L,YOU S,et al.LightViT:TowardsLight-Weight Convolution-Free Vision Transformers[J].ar-Xiv:2207.05557,2022.
[19]CHEN J,WU P,ZHANG X,et al.Add-Vit:CNN-Transformer Hybrid Architecture for Small Data Paradigm Processing[J].Neural Processing Letters,2024,56(3):198.
[20]KHAN A,RAUF Z,SOHAIL A,et al.A survey of the vision transformers and their CNN-transformer based variants[J].Artificial Intelligence Review,2023,56(3):2917-2970.
[21]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[J].arXiv:2012.12877,2021.
[22]YUAN L,CHEN Y,WANG T,et al.Tokens-to-Token ViT:Training Vision Transformers from Scratch on ImageNet[J].arXiv:2101.11986,2021.
[23]LIU Z,LIN Y,CAO Y,et al.Swin Transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[24]HAN K,XIAO A,WU E,et al.Transformer in Transformer[J].arXiv:2103.00112,2021.
[25]TANG S,GONG R,WANG Y,et al.RobustART:Benchmarking Robustness on Architecture Design and Training Techniques[J].arXiv:2109.05211,2022.
[26]PAUL S,CHEN P Y.Vision Transformers Are Robust Lear-ners[J].Proceedings of the AAAI Conference on Artificial Intelligence,2022,36(2):2071-2081.
[27]MAURÍCIO J,DOMINGUES I,BERNARDINO J.ComparingVision Transformers and Convolutional Neural Networks for Image Classification:A Literature Review[J].Applied Sciences,2023,13(9):5521.
[28]GU J,TRESP V,QIN Y.Evaluating Model Robustness to Patch Perturbations[C]//ICML 2022 Shift Happens Workshop.2022.
[29]BENZ P,HAM S,ZHANG C,et al.Adversarial RobustnessComparison of Vision Transformer and MLP-Mixer to CNNs[J].arXiv:2110.02797,2021.
[30]SHAO R,SHI Z,YI J,et al.On the Adversarial Robustness of Vision Transformers[J].arXiv:2103.15670,2021.
[31]KIM G,KIM J,LEE J S.Exploring Adversarial Robustness of Vision Transformers in the Spectral Perspective[C]//Procee-dings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2024:3976-3985.
[32]JOSHI A,JAGATAP G,HEGDE C.Adversarial Token Attacks on Vision Transformers[J].arXiv:2110.04337,2021.
[33]LOVISOTTO G,FINNIE N,MUNOZ M,et al.Give Me Your Attention:Dot-Product Attention Considered Harmful for Adversarial Patch Robustness[J].arXiv:2203.13639,2022.
[34]NAVANEET K L,KOOHPAYEGANI S A,SLEIMAN E,et al.SlowFormer:Adversarial Attack on Compute and Energy Consumption of Efficient Vision Transformers[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2024:24786-24797.
[35]NASEER M,RANASINGHE K,KHAN S,et al.On Improving Adversarial Transferability of Vision Transformers[J].arXiv:2106.04169,2021.
[36]ZHANG J,HUANG Y,WU W,et al.Transferable Adversarial Attacks on Vision Transformers With Token Gradient Regularization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:16415-16424.
[37]MING D,REN P,WANG Y,et al.Boosting the Transferability of Adversarial Attack on Vision Transformer with Adaptive Token Tuning[J].Advances in Neural Information Processing Systems,2024,37:20887-20918.
[38]ZHANG J,HUANG Y,XU Z,et al.Improving the Adversarial Transferability of Vision Transformers with Virtual Dense Connection[J].Proceedings of the AAAI Conference on Artificial Intelligence,2024,38(7):7133-7141.
[39]WANG Y,WANG J,YIN Z,et al.Generating Transferable Adversarial Examples against Vision Transformers[C]//Procee-dings of the 30th ACM International Conference on Multimedia.ACM,2022:5181-5190.
[40]GUO X,CHEN P,LU Z,et al.Towards transferable adversarial attacks on vision transformers for image classification[J].Journal of Systems Architecture,2024,152:103155.
[41]WANG X,ZHANG Z,ZHANG J.Structure Invariant Transformation for better Adversarial Transferability[C]//2023 IEEE/CVF International Conference on Computer Vision(ICCV).IEEE,2023:4584-4596.
[42]ZHOU H,TAN Y,WANG Y,et al.Improving the Transferabi-lity of Adversarial Examples with Restructure Embedded Patches[J].arXiv:2204.12680,2022.
[43]MA W,LI Y,JIA X,et al.Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:4630-4639.
[44]REN Y,ZHAO Z,LIN C,et al.Improving Integrated Gradient-based Transferable Adversarial Examples by Refining the Integration Path[J].Proceedings of the AAAI Conference on Artificial Intelligence,2025,39(7):6731-6739.
[45]BAN Y,DONG Y.Pre-trained Adversarial Perturbations[J].Advances in Neural Information Processing Systems,2022,35:1196-1209.
[46]ZHOU Z,HU S,ZHAO R,et al.Downstream-agnostic Adver-sarial Examples[C]//2023 IEEE/CVF International Conference on Computer Vision(ICCV).2023:4322-4332.
[47]ZHENG W,MA X,HUANG H,et al.Downstream TransferAttack:Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers[J].arXiv:2408.01705,2024.
[48]CHEN Z,GUO H,JIANG K,et al.Boosting Adversarial Transferability with Spatial Adversarial Alignment[J].arXiv:2501.01015,2025.
[49]SHI Y,HAN Y,TAN Y,et al.Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal[J].Advances in Neural Information Processing Systems,2022,35:12921-12933.
[50]MUMCU F,YILMAZ Y.Sequential architecture-agnostic black-box attack design and analysis[J].Pattern Recognition,2024,147:110066.
[51]ZHOU C,SHI X,WANG Y G.Query-Efficient Hard-LabelBlack-Box Attack against Vision Transformers[J].arXiv:2407.00389,2024.
[52]BHOJANAPALLI S,CHAKRABARTI A,GLASNER D,et al.Understanding Robustness of Transformers for Image Classification[C]//2021 IEEE/CVF International Conference on Computer Vision(ICCV).2021:10211-10221.
[53]PAPERNOT N,MCDANIEL P,JHA S,et al.The Limitations of Deep Learning in Adversarial Settings[C]//2016 IEEE European Symposium on Security and Privacy(EuroS&P).IEEE,2016:372-387.
[54]WIYATNO R,XU A.Maximal Jacobian-based Saliency MapAttack[J].arXiv:1808.07945,2018.
[55]CHEN Z,XIE L,NIU J,et al.Visformer:The Vision-friendly Transformer[C]//2021 IEEE/CVF International Conference on Computer Vision(ICCV).2021:569-578.
[56]TOUVRON H,CORD M,SABLAYROLLES A,et al.Goingdeeper with Image Transformers[C]//2021 IEEE/CVF International Conference on Computer Vision(ICCV).IEEE,2021:32-42.
[57]WANG X,REN J,LIN S,et al.A Unified Approach to Interpreting and Boosting Adversarial Transferability[J].arXiv:2010.04055,2020.
[58]XIE C,ZHANG Z,ZHOU Y,et al.Improving Transferability of Adversarial Examples With Input Diversity[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2019:2725-2734.
[59]SUNDARARAJAN M,TALY A,YAN Q.Axiomatic Attribu-tion for Deep Networks[C]//Proceedings of the 34th International Conference on Machine Learning.PMLR,2017:3319-3328.
[60]XIE C,ZHANG Z,ZHOU Y,et al.Improving Transferability of Adversarial Examples With Input Diversity[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2019:2725-2734.
[61]DONG Y,PANG T,SU H,et al.Evading Defenses to Transferable Adversarial Examples by Translation-Invariant Attacks[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2019:4307-4316.
[62]LIN J,SONG C,HE K,et al.Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks[J].arXiv:1908.06281,2019.
[63]MOOSAVI-DEZFOOLI S M,FAWZI A,FAWZI O,et al.Universal Adversarial Perturbations[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2017:86-94.
[64]MA A,FARAHMAND A M,PAN Y,et al.Improving Adversarial Transferability via Model Alignment[C]//LEONARDIS A,RICCI E,ROTH S,et al.Computer Vision－ECCV 2024.Cham:Springer Nature Switzerland,2025:74-92.
[65]BRENDEL W,RAUBER J,BETHGE M.Decision-Based Ad-versarial Attacks:Reliable Attacks Against Black-Box Machine Learning Models[J].arXiv:1712.04248,2018.
[66]TARTAKOVSKY A,NIKIFOROV I,BASSEVILLE M.Se-quential Analysis:Hypothesis testing and changepoint detection[M].CRC Press:2014.
[67]CHENG M,SINGH S,CHEN P,et al.Sign-OPT:A Query-Efficient Hard-label Adversarial Attack[J].arXiv:1909.10773,2019.
[68]ALAYRAC J B,DONAHUE J,LUC P,et al.Flamingo:a Visual Language Model for Few-Shot Learning[J].Advances in Neural Information Processing Systems,2022,35:23716-23736.
[69]SHAO M.Designing Physical-World Universal Attacks on Vision Transformers[C]//Neurips Safe Generative AI Workshop 2024.2024.
[70]LI X,ZHAO C,DENG X,et al.VTFR-AT:Adversarial Training with Visual Transformation and Feature Robustness[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2024,8(4):3129-3140.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey of Adversarial Sample Attacks for Vision Transformer

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	CHEN Yuansheng, CHEN Shunjue, MO Xuan, WU Weigang, LI Jialun. Deep Learning Training Time Prediction Algorithm Integrating Multi-dimensional Operator Features [J]. Computer Science, 2026, 53(5): 129-136.
[2]	ZHENG Cheng, BAN Qingqing. Knowledge-assisted and Reinforced Syntax-driven for Aspect-based Sentiment Analysis [J]. Computer Science, 2026, 53(4): 406-414.
[3]	YIN Chuang, LIU Jianyi, ZHANG Ru. Cross-modal Fusion Few-sample Ransomware Classifier:Multimodal Encoding Based on Pre-trained Models [J]. Computer Science, 2026, 53(4): 435-444.
[4]	GAO Tai, REN Yanzhang, WANG Huiqing, LI Ying, WANG Bin. KGMamba:Gene Regulatory Network Prediction Model Based on Kolmogorov-Arnold Network Optimizing Graph Convolutional Network and Mamba [J]. Computer Science, 2026, 53(4): 101-111.
[5]	ZHANG Xueqin, WANG Zhineng, LI Jinsheng, LU Yisong, LUO Fei. Key Node Identification in Temporal Social Networks Based on Deep Learning and Multi-feature Fusion [J]. Computer Science, 2026, 53(4): 143-154.
[6]	GU Bokai, LIU Dun, SUN Yang. STWD-DLFRD:Multi-granularity Fake Review Detection via Sequential Three-way Decisions and Deep Learning [J]. Computer Science, 2026, 53(4): 188-196.
[7]	ZHENG Yi, JIA Xinghao, ZHANG Junwen, REN Shuang. Image Classification Based on Hybrid Quantum-Classical Long-Short Range Feature Extension Network [J]. Computer Science, 2026, 53(4): 277-283.
[8]	CHEN Han, XU Zefeng, JIANG Jiu, FAN Fan, ZHANG Junjian, HE Chu, WANG Wenwei. Large Language Model and Deep Network Based Cognitive Assessment Automatic Diagnosis [J]. Computer Science, 2026, 53(3): 41-51.
[9]	LI Zequn, DING Fei. Fatigue Driving Detection Based on Dual-branch Fusion and Segmented Domain AdaptationTransfer Learning [J]. Computer Science, 2026, 53(3): 78-87.
[10]	FU Yukai, LI Qingzhen, DONG Zhixue, SHI Dongli, ZHAO Peng. Pedestrian Re-identification Methods Based on Limited Target Data and Deep Learning [J]. Computer Science, 2026, 53(3): 287-294.
[11]	YU Ding, LI Zhangwei. Prediction Method of RNA Secondary Structure Based on Transformer Architecture [J]. Computer Science, 2026, 53(3): 375-382.
[12]	DU Jiantong, GUAN Zeli, XUE Zhe. Multi-task Learning-based Ophthalmic Video Feature Fusion and Multi-dimensional Profiling [J]. Computer Science, 2026, 53(3): 383-391.
[13]	SU Ruitao, REN Jiongjiong, CHEN Shaozhen. Deep Learning-based Neural Differential Distinguishers for GIFT-128 and ASCON [J]. Computer Science, 2026, 53(3): 453-458.
[14]	CHANG Xuanwei, DUAN Liguo, CHEN Jiahao, CUI Juanjuan, LI Aiping. Method for Span-level Sentiment Triplet Extraction by Deeply Integrating Syntactic and Semantic Features [J]. Computer Science, 2026, 53(2): 322-330.
[15]	XI Penghui, WU Xiazhen, JIANG Wencong, FANG Liangda, HE Chaobo, GUAN Quanlong. Review of Personalized Educational Resource Recommendations [J]. Computer Science, 2026, 53(2): 1-15.