多模态大语言模型的安全性研究综述

doi:10.11896/jsjkx.241100141

Abstract

Abstract: With the rapid development of large language models,multimodal large language models have garnered attention for their outstanding performance across various modalities,such as language and images.These models have not only become valuable assistants in daily tasks but are also gradually penetrating major application areas,such as autonomous driving and medical diagnosis.Compared to traditional large language models,multimodal large language models possess enormous potential and challenges due to their closer alignment with real-world applications involving multiple resources and the complexity of multimodal processing.However,research on the vulnerabilities of multimodal large language models is relatively limited,and these models face numerous security challenges in practical applications.This paper aims to provide a comprehensive survey of the security aspects of multimodal large language models,particularly large vision-language models.Firstly,the basic structure and development history of multimodal large language models are summarized.Then,the causes of security risks throughout the full lifecycle of these models are discussed,and the correlations between model structure and security risks are analyzed.Next,this paper systematically summarizes current efforts in evaluating the security of multimodal large language models in terms of image and text security,including model hallucinations,privacy security,bias,and robustness.Attacks on multimodal large language models are divied into jailbreak attacks,adversarial attacks,backdoor attacks,and poisoning attacks.Furthermore,the paper provides a comprehensive overview of a range of trustworthy enhancement methods addressing threats such as hallucinations,privacy leaks,and bias in multimodal large language models,as well as defense mechanisms against malicious attacks on the models.Finally,the main opportunities and challenges in the security research of multimodal large language models are discussed,and guidance and recommendations are provided for researchers in the complex applications and research areas of multimodal large language models.

Key words: Multimodal large language models, Security, Hallucinations, Adversarial, Jailbreak, Defence

CLC Number:

TP391

CHEN Jinyin, XI Changkun, ZHENG Haibin, GAO Ming, ZHANG Tianxin. Survey of Security Research on Multimodal Large Language Models[J].Computer Science, 2025, 52(7): 315-341.

References

[1]JI J,QIU T,CHEN B,et al.Ai alignment:a comprehensive survey[J].arXiv:2310.19852,2023.
[2]YUAN J,SUN S,OMEIZA D,et al.Rag-driver:generalisable driving explanations with retrieval-augmented in-context learning in multi-modal large language model[J].arXiv:2402.10828,2024.
[3]ZHANG Z,ZHANG A,LI M,et al.Multimodal chain-of-thought reasoning in language models[J].arXiv:2302.00923,2023.
[4]SREERAM S,WANG T H,MAALOUF A,et al.Probing multimodal llms as world models for driving[J].arXiv:2405.05956,2024.
[5]ZHANG X,WU C,ZHAO Z,et al.PMC-VQA: visual instruction tuning for medical visual question answering[J].arXiv:2305.10415,2024.
[6]RAHMAN Md A,ALQAHTANI L,ALBOOQ A,et al.A survey on security and privacy of large multimodal deep learning models:teaching and learning perspective[C]//2024 21st Learning and Technology Conference (L&T).2024:13-18.
[7]AVSEC Ž,AGARWAL V,VISENTIN D,et al.Effective gene expression prediction from sequence by integrating long-range interactions[J].Nature Methods,2021,18(10):1196-1203.
[8]CABELLO L,BUGLIARELLO E,BRANDL S,et al.Evaluating bias and fairness in gender-neutral pretrained vision-and-language models[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:8465-8483.
[9]SAMSON L,BARAZANI N,GHEBREAB S,et al.Privacy-aware visual language models[J].arXiv:2405.17423,2024.
[10]BAI Z,WANG P,XIAO T,et al.Hallucination of multimodallLarge language models:a survey[J].arXiv:2404.18930,2024.
[11]LIU D,YANG M,QU X,et al.A survey of attacks on large vision-language models:resources,advances,and future trends[J].arXiv:2407.07403,2024.
[12]YAO Y,DUAN J,XU K,et al.A survey on large language mo-del (llm) security and privacy:the good,the bad,and the ugly[J].High-Confidence Computing,2024,4(2):100211.
[13]DONG Z,ZHOU Z,YANG C,et al.Attacks,defenses and eva-luations for llm conversation safety:a Survey[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2024:6734-6747.
[14]SUN H,ZHANG Z,DENG J,et al.Safety assessment of chinese large language models[J].arXiv:2304.10436,2023.
[15]TU H,CUI C,WANG Z,et al.How many unicorns are in this image? a safety evaluation benchmark for vision llms[J].arXiv:2311.16101,2023.
[16]LIU X,ZHU Y,LAN Y,et al.Safety of multimodal large language models on images and texts[C]//Proceedings of the Thirty Third International Joint Conference on Artificial Intelligence.2024:8151-8159.
[17]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[18]TAN M,LE Q V.Efficientnet:rethinking model scaling for convolutional neural networks[C]//International Conference on Machine Learning.PMLR,2019:6105-6114.
[19]SMITH S L,BROCK A,BERRADA L,et al.Convnets match vision transformers at scale[J].arXiv:2310.16764,2023.
[20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16×16 words:transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[21]LIU Z,LIN Y,CAO Y,et al.Swin transformer:hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[22]FANG Y,WANG W,XIE B,et al.Eva:exploring the limits of masked visual representation learning at scale[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:19358-19369.
[23]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[24]SUN Q,FANG Y,WU L,et al.EVA-CLIP:improved training techniques for clip at scale[J].arXiv:2303.15389,2023.
[25]ZHU D,CHEN J,SHEN X,et al.MiniGPT-4:enhancing vision-language understanding with advanced large language models[J].arXiv:2304.10592,2023.
[26]CAI Y,LIU Y,ZHANG Z,et al.CLAP:isolating content from style through contrastive learning with augmented prompts[C]//European Conference on Computer Vision.2024:130-147.
[27]RADFORD A,KIM J W,XU T,et al.Robust speech recognition via large-scale weak supervision[C]//International Conference on Machine Learning.PMLR,2023:28492-28518.
[28]HSU W N,BOLTE B,TSAI Y H H,et al.Hubert:self-supervised speech representation learning by masked prediction of hidden units[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:3451-3460.
[29]ARNAB A,DEHGHANI M,HEIGOLD G,et al.Vivit:a video vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:6836-6846.
[30]ZHAO L,GUNDAVARAPU N B,YUAN L,et al.Videoprism:a foundational visual encoder for video understanding[C]//Proceedings of the 41st International Conference on Machine Lear- ning.PMLR,2024:60785-60811.
[31]FEDUS W,ZOPH B,SHAZEER N.Switch transformers:Sca-ling to trillion parameter models with simple and efficient sparsity[J].Journal of Machine Learning Research,2022,23(120):1-39.
[32]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901.
[33]CHUNG H W,HOU L,LONGPRE S,et al.Scaling instruction-finetuned language models[J].Journal of Machine Learning Research,2024,25(70):1-53.
[34]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Openfoundation and fine-tuned chat models[J].arXiv:2307.09288,2023.
[35]MA C,ZHANG Y,SHEN S,et al.Vicuna:An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality[EB/OL].(2023-03-30)[2025-05-08].https://lmsys.org/blog/2023-03-30-vicuna/.
[36]BAI J,BAI S,CHU Y,et al.Qwen technical report[J].arXiv:2309.16609,2023.
[37]LI J,LI D,SAVARESE S,et al.Blip-2:Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International Conference on Machine Lear- ning.PMLR,2023:19730-19742.
[38]JIAN Y,GAO C,VOSOUGHI S.Bootstrapping vision-language learning with decoupled language pre-training[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:57-72.
[39]YE Q,XU H,XU G,et al.mplug-owl:Modularization empowers large language models with multimodality[J].arXiv:2304.14178,2023.
[40]BAI J,BAI S,YANG S,et al.Qwen-VL:a versatile vision-language model for understanding,localization,text reading,and beyond[J].arXiv:2308.12966,2023.
[41]LU J,GAN R,ZHANG D,et al.Lyrics:Boosting fine-grained language-vision alignment and comprehension via semantic- aware visual objects[J].arXiv:2312.05278,2023.
[42]LIU H,LI C,WU Q,et al.Visual instruction tuning[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:34892-34916.
[43]LI C,WONG C,ZHANG S,et al.Llava-med:Training a largelanguage-and-vision assistant for biomedicine in one day[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:28541-28564.
[44]SU Y,LAN T,LI H,et al.PandaGPT:One Model To Instruction-Follow Them All[C]//Proceedings of the 1st Workshop on Taming Large Language Models:Controllability in the era of Interactive Assistants.2023:11-23.
[45]YE Q,XU H,XU G,et al.mplug-owl:Modularization empowers large language models with multimodality[J].arXiv:2304.14178,2023.
[46]SHARMA P,DING N,GOODMAN S,et al.Conceptual captions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2556-2565.
[47]CHANGPINYO S,SHARMA P,DING N,et al.Conceptual 12m:Pushing web-scale image-text pre-training to recognize long-tail visual concepts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3558-3568.
[48]SCHUHMANN C,BEAUMONT R,VENCU R,et al.Laion-5b:an open large-scale dataset for training next generation image-text models[C]//Proceedings of the 36h International Conference on Neural Information Processing Systems.2022,25278-5294.
[49]CHEN L,LI J,DONG X,et al.Sharegpt4v:Improving largemulti-modal models with better captions[C]//European Confe- rence on Computer Vision.Cham:Springer,2025:370-387.
[50]CHEN G H,CHEN S,ZHANG R,et al.Allava:Harnessing gpt4v-synthesized data for a lite vision-language model[J].ar- Xiv:2402.11684,2024.
[51]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901.
[52]DAI W L,LI J N,L D X,et al.InstructBLIP:towards general-purpose vision-language models with instruction tuning[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:49250-49267.
[53]CHEN F,HAN M,ZHAO H,et al.X-llm:Bootstrapping ad-vanced large language models by treating multi-modalities as foreign languages[J].arXiv:2305.04160,2023.
[54]ZHANG R,HAN J,LIU C,et al.Llama-adapter:Efficient fine-tuning of language models with zero-init attention[J].arXiv:2303.16199,2023.
[55]WANG W,CHEN Z,CHEN X,et al.Visionllm:Large language model is also an open-ended decoder for vision-centric tasks[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:61501-61513.
[56]ZHAO Z,GUO L,YUE T,et al.Chatbridge:Bridging modalities with large language model as a language catalyst[J].arXiv:2305.16103,2023.
[57]LI L,YIN Y,LI S,et al.M3IT:A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning[J].arXiv:2306.04387,2023.
[58]GAO P,HAN J,ZHANG R,et al.Llama-adapter v2:Parameter-efficient visual instruction model[J]. arXiv:2304.15010,2023.
[59]WANG Y,KORDI Y,MISHRA S,et al.Self-instruct:Aligning language models with self-generated instructions[J]. arXiv:2212.10560,2022.
[60]LUO G,ZHOU Y,REN T,et al.Cheap and quick:Efficient vision-language instruction tuning for large language models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:29615-29627.
[61]XU Z,SHEN Y,HUANG L.Multiinstruct:Improving multi-modal zero-shot learning via instruction tuning[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics .2023:11445-11465.
[62]ZENG Y,ZHANG H,ZHENG J,et al.What Matters in Trai-ning a GPT4-Style Language Model with Multimodal Inputs?[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2024:7930-7957.
[63]OUYANG L,WU J,JIANG X,et al.Training language models to follow instructions with human feedback[[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems.2023:27730-27744.
[64]RAFAILOV R,SHARMA A,MITCHELL E,et al.Direct pre-ference optimization:Your language model is secretly a reward mode[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:53728-53741.
[65]LI L,XIE Z,LI M,et al.Silkie:Preference distillation for large visual language models[J].arXiv:2312.10665,2023.
[66] YU T,YAO Y,ZHANG H,et al.RLHF-V:towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2024:13807-13816.
[67]ZHU D,CHEN J,SHEN X,et al.Minigpt-4:Enhancing vision-language understanding with advanced large language models[J].arXiv:2304.10592,2023.
[68] LI K C,HE Y,WANG Y,et al.Videochat:Chat-centric video understanding[J].arXiv:2305.06355,2023.
[69]CHU Y,XU J,ZHOU X,et al.Qwen-audio:Advancing universal audio understanding via unified large-scale audio-language models[J].arXiv:2311.07919,2023.
[70]WU C,YIN S,QI W,et al.Visual chatgpt:Talking,drawing and editing with visual foundation models[J].arXiv:2303.04671,2023.
[71]SHEN Y,SONG K,TAN X,et al.Hugginggpt:Solving ai tasks with chatgpt and its friends in hugging face[J].Advances in Neural Information Processing Systems,2024,36.
[72]HUANG R,LI M,YANG D,et al.Audiogpt:Understanding and generating speech,music,sound,and talking head[C]// Procee- dings of the AAAI Conference on Artificial Intelligence.2024:23802-23804.
[73]WU S,FEI H,QU L,et al.Next-gpt:any-to-any multimodal LLM[J].arXiv:2309.05519,2023.
[74]TANG Z,YANG Z,KHADEMI M,et al.CoDi-2:In-Context Interleaved and Interactive Any-to-Any Generation[C]// Procee- dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:27425-27434.
[75] LIU Y,DENG G,LI Y,et al.Prompt Injection attack against LLM-integrated Applications[J].arXiv:2306.05499,2023.
[76]DAI S,XU C,XU S,et al.Bias and unfairness in information retrieval systems:New challenges in the llm era[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.2024:6437-6447.
[77]YAO Y,DUAN J,XU K,et al.A survey on large language mo- del (llm) security and privacy:The good,the bad,and the ugly[J].High-Confidence Computing,2024,4(2):100211.
[78]DONG Z,ZHOU Z,YANG C,et al.Attacks,defenses and eva-luations for llm conversation safety:A survey[J].arXiv:2402.09283,2024.
[79]PI R,HAN T,ZHANG J,et al.MLLM-Protector:EnsuringMLLM’s Safety without Hurting Performance[J].arXiv:2401.02906,2024.
[80]GONG Y,RAN D,LIU J,et al.FigStep:jailbreaking large vision-language models via typographic visual prompts[J].arXiv:2311.05608,2023.
[81]MENDES E,CHEN Y,HAYS J,et al.Granular privacy control for geolocation with vision languagemodels[J].arXiv:2407.04952,2024.
[82]YANG Z,WEI Y,LIANG C,et al.Quantifying and enhancingmulti-modal robustness with modality preference[C]//International Conference on Learning Representations.2024:1-23.
[83]BIRHANE A,PRABHU V,HAN S,et al.Into the laions den:investigating hate in multimodal datasets[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2023:21268-21284.
[84] YANG Z,HE X,LI Z,et al.Data poisoning attacks against multimodal encoders[C]//International Conference on Machine Learning.PMLR,2023:39299-39313.
[85]ZHANG Y,HUANG Y,SUN Y,et al.Benchmarking trustworthiness of multimodal large language models:a comprehensive study[J].arXiv:2406.07057,2024.
[86]ZHANG H,SHAO W,LIU H,et al.AVIBench:towards evaluating the robustness of large vision-language model on adversa- rial visual-instructions[J].arXiv:2403.09346,2024.
[87]WANG S,YE X,CHENG Q,et al.Cross-modality safety alignment[J].arXiv:2406.15279,2024.
[88]WANG P,ZHANG D,LI L,et al.Inferaligner:inference-timealignment for harmlessness through cross-model gidance[J]. arXiv:2401.11206,2024.
[89]QI X,ZENG Y,XIE T,et al.Fine-tuning Aligned LanguageModels Compromises Safety,Even When Users do not intend to![J].arXiv:2310.03693,2023.
[90]LENG S,XING Y,CHENG Z,et al.The curse of multi-modalities:Evaluating hallucinations of large multimodal models across language,visual,and audio[J].arXiv:2410.12787,2024.
[91]XU Y,YAO J,SHU M,et al.Shadowcast:stealthy data poisoning attacks against vision-language models[J].arXiv:2402.06659,2024.
[92]TAO X,ZHONG S,LI L,et al.Imgtrojan:jailbreaking vision-language models with one image[J].arXiv:2403.02910,2024.
[93]LI Y,DU Y,ZHOU K,et al.Evaluating object hallucination in large vision-language models[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:292-305.
[94] ZOU X,YANG J,ZHANG H,et al.Segment everything everywhere all at once[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2023:9769-9782.
[95]LOVENIA H,DAI W,CAHYAWIJAYA S,et al.Negative object presence evaluation (nope) to measure object hallucination in vision-language models[C]//Proceedings of the 3rd Workshop on Advances in Language and Vision Research.2024:37-58.
[96]HU H,ZHANG J,ZHAO M,et al.CIEM:contrastive instruction evaluation method for better instruction tuning[J].arXiv:2309.02301,2023.
[97]JIANG C,YE W,DONG M,et al.Hal-Eval:a universal and fine-grained hallucination evaluation framework for large vision language models[C]//Proceedings of the 32nd ACM International Conference on Multimedia.2024:525-534.
[98]HUANG W,LIU H,GUO M,et al.Visual hallucinations ofmulti-modal large language models[C]//Proceedings of the Association for Computational Linguistics.2024:9614-9631.
[99]KAUL P,LI Z,YANG H,et al.THRONE:An object-based hallucination benchmark for the free-form generations of large vision-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:27228-27238.
[100]CHANDU K R,LI L,AWADALLA A,et al.Certainly Uncertain:A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness[J].arXiv:2407.01942,2024.
[101]CHEN X,WANG C,XUE Y,et al.Unified hallucination detection for multimodal large language models[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.2024:3235-3252.
[102] GUNJAL A,YIN J,BAS E.Detecting and preventing hallucinations in large vision language models [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:18135-18143.
[103]CHEN Z,ZHU Y,ZHAN Y,et al.Mitigating hallucination invisual language models with visual supervision[J].arXiv:2311.16479,2023.
[104]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:common objects in context[C]//European Conference on Computer Vision.2014:740-755.
[105]FU C,CHEN P,SHEN Y,et al.MME:a comprehensive evaluation benchmark for multimodal large language models[J].ar- Xiv:2306.13394,2023.
[107]VILLA A,ALCÁZAR J C L,SOTO A,et al.Behind the magic,merlim:multi-modal evaluation benchmark for large image-language models[J].arXiv:2312.02219,2023.
[108]ROHRBACH A,HENDRICKS L A,BURNS K,et al.Object hallucination in image captioning[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces- sing.2018:4035-4045.
[109]BEN-KISH A,YANUKA M,ALERP M,et al.MOCHa:multi-objective reinforcement mitigating caption hallucinations[J].arXiv:2312.03631,2023.
[110]LIU F,LIN K,LI L,et al.Mitigating hallucination in largemulti-modal models via robust instruction tuning[C]//The Twelfth International Conference on Learning Representations.2023:1-45.
[111]JING L,LI R,CHEN Y,et al.FaithScore:fine-grained evaluations of hallucinations in large vision-language models[J]. ar- Xiv:2311.01477,2023.
[112]WANG J,ZHOU Y,XU G,et al.Evaluation and analysis of hallucination in large vision-language models[J].arXiv:2308.15126,2023.
[113] SUN Z,SHEN S,CAO S,et al.Aligning large multimodal mo-dels with factually augmented RLHF[C]//Proceedings of the Association for Computational Linguistics.2024:13088-13110.
[114]GUAN T,LIU F,WU X,et al.Hallusion Bench:anadvanceddiagnostic suite for entangled language hallucination and visual illusion in large vision-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:14375-14385.
[115]WANG J,WANG Y,XU G,et al.AMBER:an llm-free multi-dimensional benchmark for mllms hallucination evaluation[J].arXiv:2311.07397,2024.
[116]CALDARELLA S,MANCINI M,RICCI E,et al.The phantom menace:unmasking privacy leakages in vision-language models[J].arXiv:2408.01228,2024.
[117]CHEN Y,MENDES E,DAS S,et al.Can language models be instructed to protect personal information?[J].arXiv:2310.02224,2023.
[118]GU T,ZHOU Z,HUANG K,et al.MLLMGuard:a multi-dimensional safety evaluation suite for mult-imodal large language models[J].arXiv:2406.07594,2024.
[119]WANG S,YE X,CHENG Q,et al.Cross-modality safety alignment[J].arXiv:2406.151279,2024.
[120]XIA P,CHEN Z,TIAN J,et al.CARES:a comprehensivebenchmark of trustworthiness in medical vision language models[J].arXiv:2406.06007,2024.
[121]CAPITANI G,LUCARINI A,BONICELLI L,et al.Beyond the surface:comprehensive analysis of implicit bias in vision-language models[C]//Intervento Presentato Al Convegno Fairness and Ethics towards Transparent AI:Face the Challenge through Model Debiasing.2024.
[122]SESHADRI P,SINGH S,ELAZAR Y.The bias amplification paradox in text-to-image generation[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2024:6367-6384.
[123]WAN Y,CHANG K W.The male ceo and the female assistant:gender biases in text-to-image generation of dual subjects[J].arXiv:2402.11089,2024.
[124]ZHONG Y,BAGHEL B K.Multimodal understanding of memes with fair explanations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:2007-2017.
[125]CAI R,SONG Z,GUAN D,et al.BenchLMM:ben-chmarking cross-style visual capability of large multimodal models[J]. ar- Xiv:2312.02896,2023.
[126]CUI P,WANG J.Out-of-distribution (OOD) detection based on deep learning:a review[J].Electronics,2022,11(21):3500.
[127]ZHAO Y,PANG T,DU C,et al.On Evaluating Adversarial robustness of large vision-language models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:54111-54138
[128]KHATTAK M U,NAEEM M F,HASSAN J, et al.How good is my video lmm? complex video reasoning and robustness evaluation suite for video-lmms[J].arXiv:2405.03690,2024.
[129]ZHOU K,LIU C,ZHAO X,et al.Multimodal Situational Safety[J].arXiv:2410.06172,2024.
[130]WANG Y,TENG Y,HUANG K,et al.Fake Alignment:Are LLMs Really Aligned Well?[J].arXiv:2311.05915,2023.
[131] YOU H,ZHANG H,GAN Z,et al.Ferret:refer and groundanything anywhere at any granularity[J].arXiv:2310.07704,2023.
[132] WANG L,HE J,LI S,et al.Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites[C]//International Conference on Multimedia Modeling.Cham:Springer,2024:32-45.
[133] CHEN Z,WU J,WANG W,et al.Internvl:scaling up vision foundation models and aligning for generic visual-linguistic tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:24185-24198.
[134] ZHAI B,YANG S,ZHAO X,et al.HallE-Switch:rethinkingand controlling object existence hallucinations in large vision language models for detailed caption[J].arXiv:2310.01779,2023.
[135] LI Z,YANG B,LIU Q,et al.Monkey:image resolution and text label are important things for large multi-modal models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:26763-26773.
[136] HE X,WEI L,XIE L,et al.Incorporating visual experts to resolve the information loss in multimodal large language models[J].arXiv:2401.03105,2024.
[137] JAIN J,YANG J,SHI H.Vcoder:versatile vision encoders for multimodal large language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:27992-28002.
[138] ZHAO Y,LI Z,JIN Z,et al.Enhancing the spatial awareness capability of multi-modal large language model[J].arXiv:2310.20357,2023.
[139] ZHAO Z,WANG B,OUYANG L,et al.Beyond hallucinations:enhancing lvlms through hallucination-aware direct preference optimization[J].arXiv:2311.16839,2023.
[140] SUN Z,SHEN S,CAO S,et al.Aligning large multimodal mo-dels with factually augmented rlhf[C]//Proceedings of the Association for Computational Linguistics.2024:13088-13110.
[141] JIANG C,XU H,DONG M,et al.Hallucination augmented contrastive learning for multimodal large language model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:27036-27046.
[142] CHEN Z,WU J,WANG W,et al.Internvl:scaling up vision foundation models and aligning for generic visual-linguistic tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:24185-24198.
[143] GUNJAL A,YIN J,BAS E.Detecting and preventing hallucinations in large vision language models [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:18135-18143.
[144] LI L,XIE Z,LI M,et al.Silkie:preference distillation for large visual language models[J].arXiv:2312.10665,2023.
[145] ZHOU Y,CUI C,RAFAILOV R,et al.Aligning modalities in vision large language models via preference fine-tuning [J]. ar- Xiv:2402.11411,2024.
[146] DWORK C,ROTH A.The algorithmic foundations of differential privacy[J].Foundations and Trends in Theoretical Compu- ter Science,2014,9(3／4):211-407.[147] LIU Q,HUANG Y,JIN C,et al.Privacy and integrity protection for iot multimodal data using machine learning and blockchain[J].ACM Transactions on Multimedia Computing,Communications,an-d Applications,2024,20(6):1-18.
[148] WANG L,SANG L,ZHANG Q,et al.A privacy-preserving framework with multi-modal data for cross-domain recommendation[J].Knowledge-Based Systems,2024,304:112529.
[149] SAMSON L,BARAZANI N,GHEBREAB S,et al.Privacy-aware visual language models[J].arXiv:2405.17423,2024.
[150] YIN L,LIN S,SUN Z,et al.PriMonitor:an adaptive tuning privacy-preserving approach for multimodal emotion detection[J].World Wide Web,2024,27(2):9.
[151] CAO D,WU J,BASHIR A K.Multimodal large language mo-dels driven privacy-preserving wireless semantic communication in 6g[C]//IEEE International Conference on Communications Workshops.2024:171-176.
[152] ALABDULMOHSIN I,WANG X,STEINER A,et al.CLIP the bias:how useful is balancing data in multimodal learning? [C]//International Conference on Learning Representations.2024:1-32.
[153] CHENG H,GUO Y,GUO Q,et al.Social debiasing for fair multi-modal llms[J].arXiv:2408.06569,2024.
[154] BERG H,HALL S M,BHALGAT Y,et al.A prompt array keeps the bias away:debiasing vision-language models with adversarial learning[C]//Proceedings of the 2nd Conference of the Asia-Pacif-ic Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing.2022:806-822.
[155] JIANG J,MANORANJAN V,SALAM H,et al.Towards generalised and incremental bias mitigation in personality computing[J].IEEE Transactions on Affective Computing,2024,15(4):2192-2203.
[156]WANG Z,LI X,QIN Z,et al.Can We Debias Multimodal Large Language Models via Model Editing?[C]//Proceedings of the 32nd ACM International Conference on Multimedia.2024:3219-3228.
[157]BRINKMANN J,SWOBODA P,BARTELT C.A multidimen-sional analysis of social biases in vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:4914-4923.
[158]CARLINI N,NASR M,CHOQUETTE-CHOO C A,et al.Are aligned neural networks adversarially al-igned? [C]// Procee- dings of the 37th International Conference on Neural Information Processing Systems.2023:61478-61500.
[159]SHAYEGANI E,DONG Y,ABU-GHAZALEH N.Jailbreak in pieces:compositional adversarial attacks on multi-modal language models[J].arXiv:2307.14539,2023.
[160]WANG R,MA X,ZHOU H,et al.White-box multimodal jailbreaks against large vision-language models[J].arXiv:2405.17894,2024.
[161]YIN Z,YE M,ZHANG T,et al.VLATTACK:multimodal adversarial attacks on vision-language tasks via pre-trained models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2023:52936-52956.
[162]QI X,HUANG K,PANDA A,et al.Visual adversarial examples jailbreak aligned large language models[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:21527-21536.
[163]NIU Z,REN H,GAO X,et al.Jailbreaking attack against multimodal large language model[J].arXiv:2402.02309,2024.
[164]LUO H,GU J,LIU F,et al.An image is worth 1000 lies:adversarial transferability across prompts on vision-language models[J].arXiv:2403.09766,2024.
[165]GU X,ZHENG X,PANG T,et al.Agent smith:a single image can jailbreak one million multimodal llm agents exponentially fast[J].arXiv:2402.08567,2024.
[166]MA S,LUO W,WANG Y,et al.Visual-RolePlay:universal jailbreak attack on multimodal large language models via role-playing image character[J].arXiv:2405.20773,2024.
[167]LI Y,GUO H,ZHOU K,et al.Images are achilles’ heel ofalignment:exploiting visual vulnerabilities for jailbreaking multimodal large language models[J].arXiv:2403.09792,2024.
[168]LIU X,ZHU Y,GU J,et al.MM-SafetyBench:a benchmark for safety evaluation of multimodal large language models[J]. ar- Xiv:2311.17600,2023.
[169]Madry A.Towards deep learning models resistant to adversarial attacks[J].arXiv:1706.06083,2017.
[170]CROCE F,HEIN M.Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks[C]//International Conference on Machine Learning.PMLR,2020:2206-2216.
[171]CARLINI N,WAGNER D.Towards evaluating the robustness of neural networks[C]//2017 IEEE Symposium on Security and Privacy.2017:39-57.
[172] CUI X,APARCEDO A,JANG Y K,et al.On the robustness of large multimodal models against image adversarial attacks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:24625-24634.
[173]LUO H,GU J,LIU F,et al.An image is worth 1000 lies:adversarial transferability across prompts on vision-language models[J].arXiv:2403.09766,2024.
[174]GAO K,BAI Y,GU J,et al.Inducing high energy-latency of large vision-language models with verbose images[J].arXiv:2401.11170,2024.
[175]FU X,WANG Z,LI S,et al.Misusing tools in large language models with visual adversarial examples[J].arXiv:2310.03185,2023.
[176]WU X,CHAKRABORTY S,XIAN R,et al.Highlighting the safety concerns of deploying llms/vlms in robotics[J].arXiv:2402.10340,2024.
[177]DONG Y,CHEN H,CHEN J,et al.How robust is google’s bard to adversarial image attacks?[J].arXiv:2309.11751,2023.
[178]WANG X,JI Z,MA P,et al.InstructTA:instruction-tuned targeted attack for large vision-language models[J].arXiv:2312.01886,2023.
[179]CHENG S,MIAO Y,DONG Y,et al.Efficient black-box adversarial attacks via bayesian optimization guided by a function prior [C]//Proceedings of the 41st International Conference on Machine Learning.PMLR,2024:8163-8183.
[180]FRAZIER P I.A tutorial on Bayesian optimization[J].arXiv:1807.02811,2018.
[181]CARLINI N,TERZIS A.Poisoning and backdooring contrastive learning[J].arXiv:2402.13851,2024.
[182]LIANG J,LIANG S,LUO M,et al.VL-Trojan:multimodal instruction backdoor attacks against autoregressive visual language models[J].arXiv:2402.13851,2024.
[183]NI Z,YE R,WEI Y,et al.Physical backdoor attack can jeopar-dize driving with vision-large-language models[J].arXiv:2404.12916,2024.
[184]LU D,PANG T,DU C,et al.Test-time backdoor attacks onmultimodal large language models[J].arXiv:2402.08577,2024.
[185]LIANG S,LIANG J,PANG T,et al.Revisiting backdoor at-tacks against large vision-language models[J].arXiv:2406.18844,2024.
[186]CHEN C,HUANG B,LI Z,et al.Can editing llms inject harm?[J].arXiv:2407.20224,2024.
[187]CHENG S Y,TIAN B Z,LIU Q B,et al.Can we edit multimodal large language models?[C]//Proceedings of the 2023 Confe- rence on Empirical Methods in Natural Language Processing.2023:13877-13888.
[188]WANG Y,LIU X,LI Y,et al.AdaShield:safeguarding multimodal large language models from structure-based attack via adaptive shield prompting[C]//European Conference on Computer Vision (ECCV).2024:1-25.
[189]ZHANG X,ZHANG C,LI T,et al.Jailguard:A universal detection framework for llm prompt-based attacks[J].arXiv:2312.10766,2024.
[190]PI R,HAN T,ZHANG J,et al.MLLM-Protecor:ensuringmllm’s safety without hurting performance[J].arXiv:2401.02906,2024.
[191]CHEN Y,SIKKA K,COGSWELL M,et al.Dress:instructing large vision-language models to align and interact with humans via natural language feedback[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:14239-14250.
[192]DENG J,DONG W,SOCHER R,et al.ImageNet:A Large-Scale Hierarchical Image Database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.2009:248-255.
[193]PLUMMER B A,WANG L,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2641-2649.
[194]GURARI D,LI Q,STANGL A J,et al.Vizwiz grand challenge:Answering visual questions from blind people[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3608-3617.
[195]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913.
[196]MARINO K,RASTEGARI M,FARHADI A,et al.Ok-vqa:A visual question answering benchmark requiring external know- ledge[C]//Proceedings of the IEEE/cvf Conference on Compu- ter Vision and Pattern Recognition.2019:3195-3204.
[197]LIN H,LUO Z,WANG B,et al.Goat-bench:Safety insights to large multimodal models through meme-based social abuse[J].arXiv:2401.01523,2024.
[198]WANG X,YI X,JIANG H,et al.ToViLaG:your visual-lan-guage generative model is also an evildoer [C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:3508-3533.
[199]YING Z,LIU A,LIANG S,et al.SafeBench:a safety evaluation framework for multimodal large language models[J].arXiv:2410.18927,2024.
[200]LIU X,ZHU Y,GU J,et al.MM-SafetyBench:a benchmark for safety evaluation of multimodal large language models[J]. ar- Xiv:2311.17600,2023.
[201]LI M,LI L,YIN Y,et al.Red teaming visual language models[J].arXiv:2401.12915,2024.
[202]WU Y,LI X,LIU Y,et al.Jailbreaking gpt-4v via self-adversarial attacks with system prompts[J].arXiv:2311.09127,2023.
[203]BAILEY L,ONG E,RUSSELL S,et al.Image hijacks:Adversarial images can control generative models at runtime[J]. ar- Xiv:2309.00236,2023.
[204]VAN M H,WU X.Detecting and correcting hate speech in multimodal memes with large visual language model[J].arXiv:2311.06737,2023.
[205]SCHLARMANN C,HEIN M.On the adversarial robustness of multi-modal foundation models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:3677-3685.
[206]JI Y,GE C,KONG W,et al.Large language models as automated aligners for benchmarking vision-language models[J]. ar- Xiv:2311.14580,2023.
[207]GUO Q,PANG S,JIA X,et al.Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models[J].arXiv:2404.10335,2024.
[208]FAN Y,CAO Y,ZHAO Z,et al.Unbridled icarus:a survey of the potential perils of image inputs in multimodal large language model security[J].arXiv:2407.12784,2024.
[209]ZHAO S,YANG Y,WANG Z,et al.Retrieval augmented gene-ration (rag) and beyond:a comprehensive survey on how to make your llms use external data more wisely[J].arXiv:2409.14924,2024.
[210]ZHANG B,TAN Y,SHEN Y,et al.Breaking agents:compromising autonomous llm agents through malfunction amplification[J].arXiv:2407.20859,2024.
[211]WANG Y,XUE D,ZHANG S,et al.BadAgent:inserting andactivating backdoor attacks in llm agents[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.2024:9811-9827
[212]CHEN Z,XIANG Z,XIAO C,et al.Agentpoison:red-teaming llm agents via poisoning memory or knowledge bases[J].arXiv:2407.12784,2024.
[213]HINTERSDORF D,STRUPPEK L,BRACK M,et al.Does clip know my face?[J].Journal of Artificial Intelligence Research,2024,80:1033-1062.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey of Security Research on Multimodal Large Language Models

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	ZHANG Guanghua, CHEN Fang, CHANG Jiyou, HU Boning, WANG He. Accelerating Firmware Vulnerability Discovery Through Precise Localization of IntermediateTaint Sources and Dangerous Functions [J]. Computer Science, 2025, 52(7): 379-387.
[2]	SUN Qiming, HOU Gang, JIN Wenjie, HUANG Chen, KONG Weiqiang. Survey on Fuzzing of Embedded Software [J]. Computer Science, 2025, 52(7): 13-25.
[3]	YU Yiming, CHEN Yuanzhi, LANG Jun. Analysis of DNS Threats and the Challenges of DNS Security [J]. Computer Science, 2025, 52(6A): 240900140-8.
[4]	ZHU Keda, CAI Ruijie, LIU Shengli. Large Scale Network Defense Algorithm Based on Temporal Network Flow Watermarking Technology [J]. Computer Science, 2025, 52(6A): 240900110-6.
[5]	HUANG Xiaoyu, JIANG Hemeng, LING Jiaming. Privacy Preservation of Crowdsourcing Content Based on Adversarial Generative Networks [J]. Computer Science, 2025, 52(6A): 250200123-7.
[6]	XIA Zhuoqun, ZHOU Zihao, DENG Bin, KANG Chen. Security Situation Assessment Method for Intelligent Water Resources Network Based on ImprovedD-S Evidence [J]. Computer Science, 2025, 52(6A): 240600051-6.
[7]	LI Weifeng, XIE Jiangping. Study on System Security Testing Method Based on Digital Twin [J]. Computer Science, 2025, 52(6A): 240700068-7.
[8]	WANG Yun, ZHAO Jianming, GUO Yifeng, ZHOU Huanhuan, ZHOU Wuai, ZHANG Wanzhe, FENG Jianhua. Automation and Security Strategies and Empirical Research on Operation and Maintenance of Digital Government Database [J]. Computer Science, 2025, 52(6A): 240500045-8.
[9]	GAO Xinjun, ZHANG Meixin, ZHU Li. Study on Short-time Passenger Flow Data Generation and Prediction Method for RailTransportation [J]. Computer Science, 2025, 52(6A): 240600017-5.
[10]	ZHANG Yaolin, LIU Xiaonan, DU Shuaiqi, LIAN Demeng. Hybrid Quantum-classical Compressed Generative Adversarial Networks Based on Matrix Product Operators [J]. Computer Science, 2025, 52(6): 74-81.
[11]	LIU Huayong, ZHU Ting. Semi-supervised Cross-modal Hashing Method for Semantic Alignment Networks Basedon GAN [J]. Computer Science, 2025, 52(6): 159-166.
[12]	WEI Youyuan, SONG Jianhua, ZHANG Yan. Survey of Binary Code Similarity Detection Method [J]. Computer Science, 2025, 52(6): 365-380.
[13]	KANG Kai, WANG Jiabao, XU Kun. Balancing Transferability and Imperceptibility for Adversarial Attacks [J]. Computer Science, 2025, 52(6): 381-389.
[14]	ZHENG Xu, HUANG Xiangjie, YANG Yang. Reversible Facial Privacy Protection Method Based on “Invisible Masks” [J]. Computer Science, 2025, 52(5): 384-391.
[15]	WANG Yifei, ZHANG Shengjie, XUE Dizhan, QIAN Shengsheng. Self-supervised Backdoor Attack Defence Method Based on Poisoned Classifier [J]. Computer Science, 2025, 52(4): 336-342.