针对视频语义描述模型的稀疏对抗样本攻击

doi:10.11896/jsjkx.221100068

Abstract

Abstract: Despite the fact that multi-modal deep learning such as image captioning model has been proved to be vulnerable to adversarial examples,the adversarial susceptibility in video caption generation is under-examined.There are two main reasons for this.On the one hand,the video captioning model input is a stream of images rather than a single picture in contrast to image captioning systems.The calculation would be enormous if we perturb each frame of a video.On the other hand,compared with the video recognition model,the output of the model is not a single word,but a more complex semantic description.To solve the above problems and study the robustness of video captioning model,this paper proposes a sparse adversarial attack method.Firstly,a method is proposed based on the idea derived from saliency maps in image object recognition model to verify the contribution of different frames to the video captioning model output and a L₂norm based optimistic objective function suited for video caption models is designed.With a high success rate of 96.4% for the targeted attack and a reduction in queries of more than 45% compared to randomly selecting video frames,the evaluation on the MSR-VTT dataset demonstrates the effectiveness of our strategy as well as reveals the vulnerability of the video caption model.

Key words: Multi-model, Video caption, Adversarial example, Saliency map, Keyframe select

CLC Number:

TP391.41

QIU Jiangxing, TANG Xueming, WANG Tianmei, WANG Chen, CUI Yongquan, LUO Ting. Sparse Adversarial Examples Attacking on Video Captioning Model[J].Computer Science, 2023, 50(12): 330-336.

References

[1]YUHAS B P,GOLDSTEIN M H,SEJNOWSKI T J.Integration of acoustic and visual speech signals using neural networks[J].IEEE Communications Magazine,1989,27(11):65-71.
[2]LONG Y,TANG P,WANG H,et al.Improving reasoning with contrastive visual information for visual question answering[J].Electronics Letters,2021,57(20):758-760.
[3]BAIS,AN S.A survey on automatic image caption generation[J].Neurocomputing,2018,311:291-304.
[4]ZHOU L,HUANG Y Y.Video Captioning Based on ChannelSoft Attention and Semantic Reconstructor[J].Future Internet,2022,13(2):55.
[5]RYU H,KANG S,KANG H,et al.Semantic Grouping Network for Video Captioning[J].Proceedings of the AAAI Conference on Artificial Intelligence,2021,35(3):2514-2522.
[6]MOCTEZUMA D,RAMÍREZ-DELREAL T,RUIZ G,et al.Video Captioning:a comparative review of where we are and which could be the route[J].arXiv:2204.05976,2022.
[7]XU X J,CHEN X Y,LIU C,et al.Fooling Vision and LanguageModels Despite Localization and Attention Mechanism[C]//CVPR.2018.
[8]CHEN H,ZHANG H,CHEN P Y,et al.Attacking Visual Lan-guage Grounding with Adversarial Examples:A Case Study on Neural Image Captioning[J].arXiv:1712.02051,2018.
[9]XU Y,WU B Y,SHEN F M,et al.Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables[C]//CVPR.2019:4130-4139.
[10]ADARI S K,GARCIA W,BUTLER K.Adversarial Video Captioning[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops(DSN-W).2019.
[11]HONG S,YOU T,KWAK S,et al.Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network[C]//CVPR.2015.
[12]XU J,MEI T,YAO T,et al.MSR-VTT:A Large Video De-scription Dataset for Bridging Video and Language[C]//CVPR.2016.
[13]VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to Sequence－Video to Text[C]//ICCV.2015.
[14]JOHNSON J,KARPATHY A,LI F F.DenseCap:Fully Convolutional Localization Networks for Dense Captioning[C]//CVPR.2016.
[15]AAFAQ N,AKHTAR N,LIU W,et al.Controlled CaptionGeneration for Images Through Adversarial Attacks[J].arXiv:2107.03050,2021.
[16]LIS S,NEUPANE A,PAUL S,et al.Stealthy Adversarial Perturbations Against Real-Time Video Classification Systems[C]//Proceedings 2019 Network and Distributed System Secu-rity Symposium.2019.
[17]WEI X,ZHU J,YUAN S,et al.Sparse Adversarial Perturbations for Videos[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33:8973-8980.
[18]CHEN Z K,XIE L X,PANG S M,et al.Appending Adversarial Frames for Universal Video Attack[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision(WACV).2021:3199-3208.
[19]JIANG L X,MA X J,CHEN S X,et al.Black-box Adversarial Attacks on Video Recognition Models[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:864-872.
[20]ZHANG H,ZHU L,ZHU Y,et al.Motion-Excited Sampler:Video Adversarial Attack with Sparked Prior[C]//Computer Vision(ECCV 2020).2020:240-256.
[21]WANG Z,SHA C,YANG S.Reinforcement Learning BasedSparse Black-box Adversarial Attack on Video Recognition Models[C]//Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.2021:3162-3168.
[22]SIMONYAN K,VEDALDI A,ZISSERMAN A.Deep InsideConvolutional Networks:Visualising Image Classification Mo-dels and Saliency Maps[J].arXiv:1312.6034,2013.
[23]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2014.
[24]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//CVPR.2016.
[25]PAPINENI K,ROUKOS S,WARD T,et al.BLEU[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics(ACL'02).2001.
[26]LIN C Y.ROUGE:A Package for Automatic Evaluation ofSummaries[J/OL].https://aclanthology.org/W04-1013.pdf.
[27]VEDANTAM R,LAWRENCE Z C,PARIKH D.CIDEr:Con-sensus-Based Image Description Evaluation[C]//CVPR.2015.
[28]KURAKIN A,GOODFELLOW I,BENGIO S.Adversarial Machine Learning at Scale[J].arXiv:1611.01236,2017.
[29]CARLINI N,WAGNER D.Towards Evaluating the Robustness of Neural Networks[C]//Towards Evaluating the Robustness of Neural Networks.2017.
[30]SZEGEDY C,ZAREMBA W,SUTSKEVER I,et al.Intriguing properties of neural networks[J].arXiv:1312.6199,2013.
[31]MOOSAVI-DEZFOOLI S M,FAWZI A,FROSSARD P.DeepFool:A Simple and Accurate Method to Fool Deep Neural Networks[C]//CVPR.2016.
[32]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[J].arXiv:1512.03385,2015.
[33]KURAKIN A,GOODFELLOW I,BENGIO S.Adversarial Machine Learning at Scale[J].arXiv:1611.01236,2017.
[34]ZAJAC M,ZOŁNA K,ROSTAMZADEH N,et al.AdversarialFraming for Image and Video Classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:10077-10078.
[35]INKAWHICH N,INKAWHICH M,CHEN Y,et al.Adver-sarial Attacks for Optical Flow-Based Action Recognition Classifiers[J].arXiv:1811.11875,2018.
[36]GOODFELLOW I,SHLENS J,SZEGEDY C.Explaining andharnessing adversarial examples[J].arXiv:1412.6572,2014.

Related Articles 15

[1]	WANG Yu, WANG Zuchao, PAN Rui. Survey of DGA Domain Name Detection Based on Character Feature [J]. Computer Science, 2023, 50(8): 251-259.
[2]	ZHOU Fengfan, LING Hefei, ZHANG Jinyuan, XIA Ziwei, SHI Yuxuan, LI Ping. Facial Physical Adversarial Example Performance Prediction Algorithm Based on Multi-modal Feature Fusion [J]. Computer Science, 2023, 50(8): 280-285.
[3]	LI Kun, GUO Wei, ZHANG Fan, DU Jiayu, YANG Meiyue. Adversarial Malware Generation Method Based on Genetic Algorithm [J]. Computer Science, 2023, 50(7): 325-331.
[4]	BAI Zhixu, WANG Hengjun, GUO Kexiang. Adversarial Examples Generation Method Based on Image Color Random Transformation [J]. Computer Science, 2023, 50(4): 88-95.
[5]	WU Zi-bin, YAN Qiao. Projected Gradient Descent Algorithm with Momentum [J]. Computer Science, 2022, 49(6A): 178-183.
[6]	LI Jian, GUO Yan-ming, YU Tian-yuan, WU Yu-lun, WANG Xiang-han, LAO Song-yang. Multi-target Category Adversarial Example Generating Algorithm Based on GAN [J]. Computer Science, 2022, 49(2): 83-91.
[7]	CHEN Meng-xuan, ZHANG Zhen-yong, JI Shou-ling, WEI Gui-yi, SHAO Jun. Survey of Research Progress on Adversarial Examples in Images [J]. Computer Science, 2022, 49(2): 92-106.
[8]	YANG Hao, YAN Qiao. Adversarial Character CAPTCHA Generation Method Based on Differential Evolution Algorithm [J]. Computer Science, 2022, 49(11A): 211100074-5.
[9]	WANG Chao, WEI Xiang-lin, TIAN Qing, JIAO Xiang, WEI Nan, DUAN Qiang. Feature Gradient-based Adversarial Attack on Modulation Recognition-oriented Deep Neural Networks [J]. Computer Science, 2021, 48(7): 25-32.
[10]	MA Lin, WANG Yun-xiao, ZHAO Li-na, HAN Xing-wang, NI Jin-chao, ZHANG Jie. Network Intrusion Detection System Based on Multi-model Ensemble [J]. Computer Science, 2021, 48(11A): 592-596.
[11]	TONG Xin, WANG Bin-jun, WANG Run-zheng, PAN Xiao-qin. Survey on Adversarial Sample of Deep Learning Towards Natural Language Processing [J]. Computer Science, 2021, 48(1): 258-267.
[12]	ANG Li-fang, SHI Chao-yu, LIN Su-zhen, QIN Pin-le, GAO Yuan. Multi-modal Medical Image Fusion Based on Joint Patch Clustering of Adaptive Dictionary Learning [J]. Computer Science, 2019, 46(7): 238-245.
[13]	WU Jia-ying,YANG Sai,DU Jun,LIN Hong-da. Review of Bottom-up Salient Object Detection [J]. Computer Science, 2019, 46(3): 48-52.
[14]	LI Chang-xing, WU Jie. Infrared Image and Visible Image Fusion Based on FPDEs and CBF [J]. Computer Science, 2019, 46(1): 297-302.
[15]	WANG Zhi-hui, LI Jia-tong, XIE Si-yan, ZHOU Jia, LI Hao-jie, FAN Xin. Two-stage Method for Video Caption Detection and Extraction [J]. Computer Science, 2018, 45(8): 50-53.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Sparse Adversarial Examples Attacking on Video Captioning Model

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0