计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 330-336.doi: 10.11896/jsjkx.221100068
邱江兴, 汤学明, 王天美, 王成, 崔永泉, 骆婷
QIU Jiangxing, TANG Xueming, WANG Tianmei, WANG Chen, CUI Yongquan, LUO Ting
摘要: 在多模态深度学习领域,尽管有很多研究表明图像语义描述模型容易受到对抗样本的攻击,但是视频语义描述模型的鲁棒性并没有得到很多的关注。主要原因有两点:一是与图像语义描述模型相比,视频语义描述模型的输入是一个图像流,而不是单一的图像,如果对视频的每一帧进行扰动,那么整体的计算量将会很大;二是与视频识别模型相比,视频语义描述模型的输出不是一个单词,而是更复杂的语义描述。为了解决上述问题以及研究视频描述模型的鲁棒性,提出了一种针对视频语义描述模型的稀疏对抗样本攻击方法。首先,基于图像识别领域的显著性分析的原理,提出了一种评估视频中不同帧对模型输出贡献度的方法。在此基础上,选择关键帧施加扰动。其次,针对视频语义描述模型,设计了基于L2范数的优化目标函数。在数据集MSR-VTT上的实验结果表明,所提方法在定向攻击上的成功率为96.4%,相比随机选择视频帧,查询次数减少了45%以上。上述结果验证了所提方法的有效性并揭示了视频语义描述模型的脆弱性。
中图分类号:
[1]YUHAS B P,GOLDSTEIN M H,SEJNOWSKI T J.Integration of acoustic and visual speech signals using neural networks[J].IEEE Communications Magazine,1989,27(11):65-71. [2]LONG Y,TANG P,WANG H,et al.Improving reasoning with contrastive visual information for visual question answering[J].Electronics Letters,2021,57(20):758-760. [3]BAIS,AN S.A survey on automatic image caption generation[J].Neurocomputing,2018,311:291-304. [4]ZHOU L,HUANG Y Y.Video Captioning Based on ChannelSoft Attention and Semantic Reconstructor[J].Future Internet,2022,13(2):55. [5]RYU H,KANG S,KANG H,et al.Semantic Grouping Network for Video Captioning[J].Proceedings of the AAAI Conference on Artificial Intelligence,2021,35(3):2514-2522. [6]MOCTEZUMA D,RAMÍREZ-DELREAL T,RUIZ G,et al.Video Captioning:a comparative review of where we are and which could be the route[J].arXiv:2204.05976,2022. [7]XU X J,CHEN X Y,LIU C,et al.Fooling Vision and LanguageModels Despite Localization and Attention Mechanism[C]//CVPR.2018. [8]CHEN H,ZHANG H,CHEN P Y,et al.Attacking Visual Lan-guage Grounding with Adversarial Examples:A Case Study on Neural Image Captioning[J].arXiv:1712.02051,2018. [9]XU Y,WU B Y,SHEN F M,et al.Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables[C]//CVPR.2019:4130-4139. [10]ADARI S K,GARCIA W,BUTLER K.Adversarial Video Captioning[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops(DSN-W).2019. [11]HONG S,YOU T,KWAK S,et al.Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network[C]//CVPR.2015. [12]XU J,MEI T,YAO T,et al.MSR-VTT:A Large Video De-scription Dataset for Bridging Video and Language[C]//CVPR.2016. [13]VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to Sequence-Video to Text[C]//ICCV.2015. [14]JOHNSON J,KARPATHY A,LI F F.DenseCap:Fully Convolutional Localization Networks for Dense Captioning[C]//CVPR.2016. [15]AAFAQ N,AKHTAR N,LIU W,et al.Controlled CaptionGeneration for Images Through Adversarial Attacks[J].arXiv:2107.03050,2021. [16]LIS S,NEUPANE A,PAUL S,et al.Stealthy Adversarial Perturbations Against Real-Time Video Classification Systems[C]//Proceedings 2019 Network and Distributed System Secu-rity Symposium.2019. [17]WEI X,ZHU J,YUAN S,et al.Sparse Adversarial Perturbations for Videos[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33:8973-8980. [18]CHEN Z K,XIE L X,PANG S M,et al.Appending Adversarial Frames for Universal Video Attack[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision(WACV).2021:3199-3208. [19]JIANG L X,MA X J,CHEN S X,et al.Black-box Adversarial Attacks on Video Recognition Models[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:864-872. [20]ZHANG H,ZHU L,ZHU Y,et al.Motion-Excited Sampler:Video Adversarial Attack with Sparked Prior[C]//Computer Vision(ECCV 2020).2020:240-256. [21]WANG Z,SHA C,YANG S.Reinforcement Learning BasedSparse Black-box Adversarial Attack on Video Recognition Models[C]//Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.2021:3162-3168. [22]SIMONYAN K,VEDALDI A,ZISSERMAN A.Deep InsideConvolutional Networks:Visualising Image Classification Mo-dels and Saliency Maps[J].arXiv:1312.6034,2013. [23]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2014. [24]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//CVPR.2016. [25]PAPINENI K,ROUKOS S,WARD T,et al.BLEU[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics(ACL'02).2001. [26]LIN C Y.ROUGE:A Package for Automatic Evaluation ofSummaries[J/OL].https://aclanthology.org/W04-1013.pdf. [27]VEDANTAM R,LAWRENCE Z C,PARIKH D.CIDEr:Con-sensus-Based Image Description Evaluation[C]//CVPR.2015. [28]KURAKIN A,GOODFELLOW I,BENGIO S.Adversarial Machine Learning at Scale[J].arXiv:1611.01236,2017. [29]CARLINI N,WAGNER D.Towards Evaluating the Robustness of Neural Networks[C]//Towards Evaluating the Robustness of Neural Networks.2017. [30]SZEGEDY C,ZAREMBA W,SUTSKEVER I,et al.Intriguing properties of neural networks[J].arXiv:1312.6199,2013. [31]MOOSAVI-DEZFOOLI S M,FAWZI A,FROSSARD P.DeepFool:A Simple and Accurate Method to Fool Deep Neural Networks[C]//CVPR.2016. [32]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[J].arXiv:1512.03385,2015. [33]KURAKIN A,GOODFELLOW I,BENGIO S.Adversarial Machine Learning at Scale[J].arXiv:1611.01236,2017. [34]ZAJAC M,ZOŁNA K,ROSTAMZADEH N,et al.AdversarialFraming for Image and Video Classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:10077-10078. [35]INKAWHICH N,INKAWHICH M,CHEN Y,et al.Adver-sarial Attacks for Optical Flow-Based Action Recognition Classifiers[J].arXiv:1811.11875,2018. [36]GOODFELLOW I,SHLENS J,SZEGEDY C.Explaining andharnessing adversarial examples[J].arXiv:1412.6572,2014. |
|