计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 330-336.doi: 10.11896/jsjkx.221100068

• 人工智能 • 上一篇    下一篇

针对视频语义描述模型的稀疏对抗样本攻击

邱江兴, 汤学明, 王天美, 王成, 崔永泉, 骆婷   

  1. 分布式系统安全湖北省重点实验室,湖北省大数据安全工程技术研究中心,网络空间安全学院,华中科技大学 武汉 430074
  • 收稿日期:2022-11-09 修回日期:2023-03-26 出版日期:2023-12-15 发布日期:2023-12-07
  • 通讯作者: 汤学明(xmtang@hust.edu.cn)
  • 作者简介:(jiangxingqiu@hust.edu.cn)

Sparse Adversarial Examples Attacking on Video Captioning Model

QIU Jiangxing, TANG Xueming, WANG Tianmei, WANG Chen, CUI Yongquan, LUO Ting   

  1. Hubei Key Laboratory of Distributed System Security,Hubei Engineering Research Center on Big Data Security,School of Cyber Science and Engineering,Huazhong University of Science and Technology,Wuhan 430074,China
  • Received:2022-11-09 Revised:2023-03-26 Online:2023-12-15 Published:2023-12-07
  • About author:QIU Jiangxing,born in 1998,postgra-duate.His main research interests include adversarial examples and cyber security.
    TANG Xueming,born in 1974,Ph.D,associate professor.His main research interests include number theory,cryptography and cyber security.

摘要: 在多模态深度学习领域,尽管有很多研究表明图像语义描述模型容易受到对抗样本的攻击,但是视频语义描述模型的鲁棒性并没有得到很多的关注。主要原因有两点:一是与图像语义描述模型相比,视频语义描述模型的输入是一个图像流,而不是单一的图像,如果对视频的每一帧进行扰动,那么整体的计算量将会很大;二是与视频识别模型相比,视频语义描述模型的输出不是一个单词,而是更复杂的语义描述。为了解决上述问题以及研究视频描述模型的鲁棒性,提出了一种针对视频语义描述模型的稀疏对抗样本攻击方法。首先,基于图像识别领域的显著性分析的原理,提出了一种评估视频中不同帧对模型输出贡献度的方法。在此基础上,选择关键帧施加扰动。其次,针对视频语义描述模型,设计了基于L2范数的优化目标函数。在数据集MSR-VTT上的实验结果表明,所提方法在定向攻击上的成功率为96.4%,相比随机选择视频帧,查询次数减少了45%以上。上述结果验证了所提方法的有效性并揭示了视频语义描述模型的脆弱性。

关键词: 多模态模型, 视频语义描述模型, 对抗样本攻击, 图像显著性, 关键帧选择

Abstract: Despite the fact that multi-modal deep learning such as image captioning model has been proved to be vulnerable to adversarial examples,the adversarial susceptibility in video caption generation is under-examined.There are two main reasons for this.On the one hand,the video captioning model input is a stream of images rather than a single picture in contrast to image captioning systems.The calculation would be enormous if we perturb each frame of a video.On the other hand,compared with the video recognition model,the output of the model is not a single word,but a more complex semantic description.To solve the above problems and study the robustness of video captioning model,this paper proposes a sparse adversarial attack method.Firstly,a method is proposed based on the idea derived from saliency maps in image object recognition model to verify the contribution of different frames to the video captioning model output and a L2norm based optimistic objective function suited for video caption models is designed.With a high success rate of 96.4% for the targeted attack and a reduction in queries of more than 45% compared to randomly selecting video frames,the evaluation on the MSR-VTT dataset demonstrates the effectiveness of our strategy as well as reveals the vulnerability of the video caption model.

Key words: Multi-model, Video caption, Adversarial example, Saliency map, Keyframe select

中图分类号: 

  • TP391.41
[1]YUHAS B P,GOLDSTEIN M H,SEJNOWSKI T J.Integration of acoustic and visual speech signals using neural networks[J].IEEE Communications Magazine,1989,27(11):65-71.
[2]LONG Y,TANG P,WANG H,et al.Improving reasoning with contrastive visual information for visual question answering[J].Electronics Letters,2021,57(20):758-760.
[3]BAIS,AN S.A survey on automatic image caption generation[J].Neurocomputing,2018,311:291-304.
[4]ZHOU L,HUANG Y Y.Video Captioning Based on ChannelSoft Attention and Semantic Reconstructor[J].Future Internet,2022,13(2):55.
[5]RYU H,KANG S,KANG H,et al.Semantic Grouping Network for Video Captioning[J].Proceedings of the AAAI Conference on Artificial Intelligence,2021,35(3):2514-2522.
[6]MOCTEZUMA D,RAMÍREZ-DELREAL T,RUIZ G,et al.Video Captioning:a comparative review of where we are and which could be the route[J].arXiv:2204.05976,2022.
[7]XU X J,CHEN X Y,LIU C,et al.Fooling Vision and LanguageModels Despite Localization and Attention Mechanism[C]//CVPR.2018.
[8]CHEN H,ZHANG H,CHEN P Y,et al.Attacking Visual Lan-guage Grounding with Adversarial Examples:A Case Study on Neural Image Captioning[J].arXiv:1712.02051,2018.
[9]XU Y,WU B Y,SHEN F M,et al.Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables[C]//CVPR.2019:4130-4139.
[10]ADARI S K,GARCIA W,BUTLER K.Adversarial Video Captioning[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops(DSN-W).2019.
[11]HONG S,YOU T,KWAK S,et al.Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network[C]//CVPR.2015.
[12]XU J,MEI T,YAO T,et al.MSR-VTT:A Large Video De-scription Dataset for Bridging Video and Language[C]//CVPR.2016.
[13]VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to Sequence-Video to Text[C]//ICCV.2015.
[14]JOHNSON J,KARPATHY A,LI F F.DenseCap:Fully Convolutional Localization Networks for Dense Captioning[C]//CVPR.2016.
[15]AAFAQ N,AKHTAR N,LIU W,et al.Controlled CaptionGeneration for Images Through Adversarial Attacks[J].arXiv:2107.03050,2021.
[16]LIS S,NEUPANE A,PAUL S,et al.Stealthy Adversarial Perturbations Against Real-Time Video Classification Systems[C]//Proceedings 2019 Network and Distributed System Secu-rity Symposium.2019.
[17]WEI X,ZHU J,YUAN S,et al.Sparse Adversarial Perturbations for Videos[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33:8973-8980.
[18]CHEN Z K,XIE L X,PANG S M,et al.Appending Adversarial Frames for Universal Video Attack[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision(WACV).2021:3199-3208.
[19]JIANG L X,MA X J,CHEN S X,et al.Black-box Adversarial Attacks on Video Recognition Models[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:864-872.
[20]ZHANG H,ZHU L,ZHU Y,et al.Motion-Excited Sampler:Video Adversarial Attack with Sparked Prior[C]//Computer Vision(ECCV 2020).2020:240-256.
[21]WANG Z,SHA C,YANG S.Reinforcement Learning BasedSparse Black-box Adversarial Attack on Video Recognition Models[C]//Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.2021:3162-3168.
[22]SIMONYAN K,VEDALDI A,ZISSERMAN A.Deep InsideConvolutional Networks:Visualising Image Classification Mo-dels and Saliency Maps[J].arXiv:1312.6034,2013.
[23]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2014.
[24]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//CVPR.2016.
[25]PAPINENI K,ROUKOS S,WARD T,et al.BLEU[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics(ACL'02).2001.
[26]LIN C Y.ROUGE:A Package for Automatic Evaluation ofSummaries[J/OL].https://aclanthology.org/W04-1013.pdf.
[27]VEDANTAM R,LAWRENCE Z C,PARIKH D.CIDEr:Con-sensus-Based Image Description Evaluation[C]//CVPR.2015.
[28]KURAKIN A,GOODFELLOW I,BENGIO S.Adversarial Machine Learning at Scale[J].arXiv:1611.01236,2017.
[29]CARLINI N,WAGNER D.Towards Evaluating the Robustness of Neural Networks[C]//Towards Evaluating the Robustness of Neural Networks.2017.
[30]SZEGEDY C,ZAREMBA W,SUTSKEVER I,et al.Intriguing properties of neural networks[J].arXiv:1312.6199,2013.
[31]MOOSAVI-DEZFOOLI S M,FAWZI A,FROSSARD P.DeepFool:A Simple and Accurate Method to Fool Deep Neural Networks[C]//CVPR.2016.
[32]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[J].arXiv:1512.03385,2015.
[33]KURAKIN A,GOODFELLOW I,BENGIO S.Adversarial Machine Learning at Scale[J].arXiv:1611.01236,2017.
[34]ZAJAC M,ZOŁNA K,ROSTAMZADEH N,et al.AdversarialFraming for Image and Video Classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:10077-10078.
[35]INKAWHICH N,INKAWHICH M,CHEN Y,et al.Adver-sarial Attacks for Optical Flow-Based Action Recognition Classifiers[J].arXiv:1811.11875,2018.
[36]GOODFELLOW I,SHLENS J,SZEGEDY C.Explaining andharnessing adversarial examples[J].arXiv:1412.6572,2014.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!