Computer Science ›› 2025, Vol. 52 ›› Issue (7): 315-341.doi: 10.11896/jsjkx.241100141
• Information Security • Previous Articles Next Articles
CHEN Jinyin1,2, XI Changkun1, ZHENG Haibin1,2,3, GAO Ming1, ZHANG Tianxin1
CLC Number:
[1]JI J,QIU T,CHEN B,et al.Ai alignment:a comprehensive survey[J].arXiv:2310.19852,2023. [2]YUAN J,SUN S,OMEIZA D,et al.Rag-driver:generalisable driving explanations with retrieval-augmented in-context learning in multi-modal large language model[J].arXiv:2402.10828,2024. [3]ZHANG Z,ZHANG A,LI M,et al.Multimodal chain-of-thought reasoning in language models[J].arXiv:2302.00923,2023. [4]SREERAM S,WANG T H,MAALOUF A,et al.Probing multimodal llms as world models for driving[J].arXiv:2405.05956,2024. [5]ZHANG X,WU C,ZHAO Z,et al.PMC-VQA: visual instruction tuning for medical visual question answering[J].arXiv:2305.10415,2024. [6]RAHMAN Md A,ALQAHTANI L,ALBOOQ A,et al.A survey on security and privacy of large multimodal deep learning models:teaching and learning perspective[C]//2024 21st Learning and Technology Conference (L&T).2024:13-18. [7]AVSEC Ž,AGARWAL V,VISENTIN D,et al.Effective gene expression prediction from sequence by integrating long-range interactions[J].Nature Methods,2021,18(10):1196-1203. [8]CABELLO L,BUGLIARELLO E,BRANDL S,et al.Evaluating bias and fairness in gender-neutral pretrained vision-and-language models[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:8465-8483. [9]SAMSON L,BARAZANI N,GHEBREAB S,et al.Privacy-aware visual language models[J].arXiv:2405.17423,2024. [10]BAI Z,WANG P,XIAO T,et al.Hallucination of multimodallLarge language models:a survey[J].arXiv:2404.18930,2024. [11]LIU D,YANG M,QU X,et al.A survey of attacks on large vision-language models:resources,advances,and future trends[J].arXiv:2407.07403,2024. [12]YAO Y,DUAN J,XU K,et al.A survey on large language mo-del (llm) security and privacy:the good,the bad,and the ugly[J].High-Confidence Computing,2024,4(2):100211. [13]DONG Z,ZHOU Z,YANG C,et al.Attacks,defenses and eva-luations for llm conversation safety:a Survey[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2024:6734-6747. [14]SUN H,ZHANG Z,DENG J,et al.Safety assessment of chinese large language models[J].arXiv:2304.10436,2023. [15]TU H,CUI C,WANG Z,et al.How many unicorns are in this image? a safety evaluation benchmark for vision llms[J].arXiv:2311.16101,2023. [16]LIU X,ZHU Y,LAN Y,et al.Safety of multimodal large language models on images and texts[C]//Proceedings of the Thirty Third International Joint Conference on Artificial Intelligence.2024:8151-8159. [17]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [18]TAN M,LE Q V.Efficientnet:rethinking model scaling for convolutional neural networks[C]//International Conference on Machine Learning.PMLR,2019:6105-6114. [19]SMITH S L,BROCK A,BERRADA L,et al.Convnets match vision transformers at scale[J].arXiv:2310.16764,2023. [20]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16×16 words:transformers for image recognition at scale[J].arXiv:2010.11929,2020. [21]LIU Z,LIN Y,CAO Y,et al.Swin transformer:hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022. [22]FANG Y,WANG W,XIE B,et al.Eva:exploring the limits of masked visual representation learning at scale[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:19358-19369. [23]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763. [24]SUN Q,FANG Y,WU L,et al.EVA-CLIP:improved training techniques for clip at scale[J].arXiv:2303.15389,2023. [25]ZHU D,CHEN J,SHEN X,et al.MiniGPT-4:enhancing vision-language understanding with advanced large language models[J].arXiv:2304.10592,2023. [26]CAI Y,LIU Y,ZHANG Z,et al.CLAP:isolating content from style through contrastive learning with augmented prompts[C]//European Conference on Computer Vision.2024:130-147. [27]RADFORD A,KIM J W,XU T,et al.Robust speech recognition via large-scale weak supervision[C]//International Conference on Machine Learning.PMLR,2023:28492-28518. [28]HSU W N,BOLTE B,TSAI Y H H,et al.Hubert:self-supervised speech representation learning by masked prediction of hidden units[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:3451-3460. [29]ARNAB A,DEHGHANI M,HEIGOLD G,et al.Vivit:a video vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:6836-6846. [30]ZHAO L,GUNDAVARAPU N B,YUAN L,et al.Videoprism:a foundational visual encoder for video understanding[C]//Proceedings of the 41st International Conference on Machine Lear- ning.PMLR,2024:60785-60811. [31]FEDUS W,ZOPH B,SHAZEER N.Switch transformers:Sca-ling to trillion parameter models with simple and efficient sparsity[J].Journal of Machine Learning Research,2022,23(120):1-39. [32]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901. [33]CHUNG H W,HOU L,LONGPRE S,et al.Scaling instruction-finetuned language models[J].Journal of Machine Learning Research,2024,25(70):1-53. [34]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Openfoundation and fine-tuned chat models[J].arXiv:2307.09288,2023. [35]MA C,ZHANG Y,SHEN S,et al.Vicuna:An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality[EB/OL].(2023-03-30)[2025-05-08].https://lmsys.org/blog/2023-03-30-vicuna/. [36]BAI J,BAI S,CHU Y,et al.Qwen technical report[J].arXiv:2309.16609,2023. [37]LI J,LI D,SAVARESE S,et al.Blip-2:Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International Conference on Machine Lear- ning.PMLR,2023:19730-19742. [38]JIAN Y,GAO C,VOSOUGHI S.Bootstrapping vision-language learning with decoupled language pre-training[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:57-72. [39]YE Q,XU H,XU G,et al.mplug-owl:Modularization empowers large language models with multimodality[J].arXiv:2304.14178,2023. [40]BAI J,BAI S,YANG S,et al.Qwen-VL:a versatile vision-language model for understanding,localization,text reading,and beyond[J].arXiv:2308.12966,2023. [41]LU J,GAN R,ZHANG D,et al.Lyrics:Boosting fine-grained language-vision alignment and comprehension via semantic- aware visual objects[J].arXiv:2312.05278,2023. [42]LIU H,LI C,WU Q,et al.Visual instruction tuning[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:34892-34916. [43]LI C,WONG C,ZHANG S,et al.Llava-med:Training a largelanguage-and-vision assistant for biomedicine in one day[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:28541-28564. [44]SU Y,LAN T,LI H,et al.PandaGPT:One Model To Instruction-Follow Them All[C]//Proceedings of the 1st Workshop on Taming Large Language Models:Controllability in the era of Interactive Assistants.2023:11-23. [45]YE Q,XU H,XU G,et al.mplug-owl:Modularization empowers large language models with multimodality[J].arXiv:2304.14178,2023. [46]SHARMA P,DING N,GOODMAN S,et al.Conceptual captions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2556-2565. [47]CHANGPINYO S,SHARMA P,DING N,et al.Conceptual 12m:Pushing web-scale image-text pre-training to recognize long-tail visual concepts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3558-3568. [48]SCHUHMANN C,BEAUMONT R,VENCU R,et al.Laion-5b:an open large-scale dataset for training next generation image-text models[C]//Proceedings of the 36h International Conference on Neural Information Processing Systems.2022,25278-5294. [49]CHEN L,LI J,DONG X,et al.Sharegpt4v:Improving largemulti-modal models with better captions[C]//European Confe- rence on Computer Vision.Cham:Springer,2025:370-387. [50]CHEN G H,CHEN S,ZHANG R,et al.Allava:Harnessing gpt4v-synthesized data for a lite vision-language model[J].ar- Xiv:2402.11684,2024. [51]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901. [52]DAI W L,LI J N,L D X,et al.InstructBLIP:towards general-purpose vision-language models with instruction tuning[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:49250-49267. [53]CHEN F,HAN M,ZHAO H,et al.X-llm:Bootstrapping ad-vanced large language models by treating multi-modalities as foreign languages[J].arXiv:2305.04160,2023. [54]ZHANG R,HAN J,LIU C,et al.Llama-adapter:Efficient fine-tuning of language models with zero-init attention[J].arXiv:2303.16199,2023. [55]WANG W,CHEN Z,CHEN X,et al.Visionllm:Large language model is also an open-ended decoder for vision-centric tasks[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:61501-61513. [56]ZHAO Z,GUO L,YUE T,et al.Chatbridge:Bridging modalities with large language model as a language catalyst[J].arXiv:2305.16103,2023. [57]LI L,YIN Y,LI S,et al.M3IT:A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning[J].arXiv:2306.04387,2023. [58]GAO P,HAN J,ZHANG R,et al.Llama-adapter v2:Parameter-efficient visual instruction model[J]. arXiv:2304.15010,2023. [59]WANG Y,KORDI Y,MISHRA S,et al.Self-instruct:Aligning language models with self-generated instructions[J]. arXiv:2212.10560,2022. [60]LUO G,ZHOU Y,REN T,et al.Cheap and quick:Efficient vision-language instruction tuning for large language models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:29615-29627. [61]XU Z,SHEN Y,HUANG L.Multiinstruct:Improving multi-modal zero-shot learning via instruction tuning[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics .2023:11445-11465. [62]ZENG Y,ZHANG H,ZHENG J,et al.What Matters in Trai-ning a GPT4-Style Language Model with Multimodal Inputs?[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2024:7930-7957. [63]OUYANG L,WU J,JIANG X,et al.Training language models to follow instructions with human feedback[[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems.2023:27730-27744. [64]RAFAILOV R,SHARMA A,MITCHELL E,et al.Direct pre-ference optimization:Your language model is secretly a reward mode[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:53728-53741. [65]LI L,XIE Z,LI M,et al.Silkie:Preference distillation for large visual language models[J].arXiv:2312.10665,2023. [66] YU T,YAO Y,ZHANG H,et al.RLHF-V:towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2024:13807-13816. [67]ZHU D,CHEN J,SHEN X,et al.Minigpt-4:Enhancing vision-language understanding with advanced large language models[J].arXiv:2304.10592,2023. [68] LI K C,HE Y,WANG Y,et al.Videochat:Chat-centric video understanding[J].arXiv:2305.06355,2023. [69]CHU Y,XU J,ZHOU X,et al.Qwen-audio:Advancing universal audio understanding via unified large-scale audio-language models[J].arXiv:2311.07919,2023. [70]WU C,YIN S,QI W,et al.Visual chatgpt:Talking,drawing and editing with visual foundation models[J].arXiv:2303.04671,2023. [71]SHEN Y,SONG K,TAN X,et al.Hugginggpt:Solving ai tasks with chatgpt and its friends in hugging face[J].Advances in Neural Information Processing Systems,2024,36. [72]HUANG R,LI M,YANG D,et al.Audiogpt:Understanding and generating speech,music,sound,and talking head[C]// Procee- dings of the AAAI Conference on Artificial Intelligence.2024:23802-23804. [73]WU S,FEI H,QU L,et al.Next-gpt:any-to-any multimodal LLM[J].arXiv:2309.05519,2023. [74]TANG Z,YANG Z,KHADEMI M,et al.CoDi-2:In-Context Interleaved and Interactive Any-to-Any Generation[C]// Procee- dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:27425-27434. [75] LIU Y,DENG G,LI Y,et al.Prompt Injection attack against LLM-integrated Applications[J].arXiv:2306.05499,2023. [76]DAI S,XU C,XU S,et al.Bias and unfairness in information retrieval systems:New challenges in the llm era[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.2024:6437-6447. [77]YAO Y,DUAN J,XU K,et al.A survey on large language mo- del (llm) security and privacy:The good,the bad,and the ugly[J].High-Confidence Computing,2024,4(2):100211. [78]DONG Z,ZHOU Z,YANG C,et al.Attacks,defenses and eva-luations for llm conversation safety:A survey[J].arXiv:2402.09283,2024. [79]PI R,HAN T,ZHANG J,et al.MLLM-Protector:EnsuringMLLM’s Safety without Hurting Performance[J].arXiv:2401.02906,2024. [80]GONG Y,RAN D,LIU J,et al.FigStep:jailbreaking large vision-language models via typographic visual prompts[J].arXiv:2311.05608,2023. [81]MENDES E,CHEN Y,HAYS J,et al.Granular privacy control for geolocation with vision languagemodels[J].arXiv:2407.04952,2024. [82]YANG Z,WEI Y,LIANG C,et al.Quantifying and enhancingmulti-modal robustness with modality preference[C]//International Conference on Learning Representations.2024:1-23. [83]BIRHANE A,PRABHU V,HAN S,et al.Into the laions den:investigating hate in multimodal datasets[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2023:21268-21284. [84] YANG Z,HE X,LI Z,et al.Data poisoning attacks against multimodal encoders[C]//International Conference on Machine Learning.PMLR,2023:39299-39313. [85]ZHANG Y,HUANG Y,SUN Y,et al.Benchmarking trustworthiness of multimodal large language models:a comprehensive study[J].arXiv:2406.07057,2024. [86]ZHANG H,SHAO W,LIU H,et al.AVIBench:towards evaluating the robustness of large vision-language model on adversa- rial visual-instructions[J].arXiv:2403.09346,2024. [87]WANG S,YE X,CHENG Q,et al.Cross-modality safety alignment[J].arXiv:2406.15279,2024. [88]WANG P,ZHANG D,LI L,et al.Inferaligner:inference-timealignment for harmlessness through cross-model gidance[J]. arXiv:2401.11206,2024. [89]QI X,ZENG Y,XIE T,et al.Fine-tuning Aligned LanguageModels Compromises Safety,Even When Users do not intend to![J].arXiv:2310.03693,2023. [90]LENG S,XING Y,CHENG Z,et al.The curse of multi-modalities:Evaluating hallucinations of large multimodal models across language,visual,and audio[J].arXiv:2410.12787,2024. [91]XU Y,YAO J,SHU M,et al.Shadowcast:stealthy data poisoning attacks against vision-language models[J].arXiv:2402.06659,2024. [92]TAO X,ZHONG S,LI L,et al.Imgtrojan:jailbreaking vision-language models with one image[J].arXiv:2403.02910,2024. [93]LI Y,DU Y,ZHOU K,et al.Evaluating object hallucination in large vision-language models[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:292-305. [94] ZOU X,YANG J,ZHANG H,et al.Segment everything everywhere all at once[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2023:9769-9782. [95]LOVENIA H,DAI W,CAHYAWIJAYA S,et al.Negative object presence evaluation (nope) to measure object hallucination in vision-language models[C]//Proceedings of the 3rd Workshop on Advances in Language and Vision Research.2024:37-58. [96]HU H,ZHANG J,ZHAO M,et al.CIEM:contrastive instruction evaluation method for better instruction tuning[J].arXiv:2309.02301,2023. [97]JIANG C,YE W,DONG M,et al.Hal-Eval:a universal and fine-grained hallucination evaluation framework for large vision language models[C]//Proceedings of the 32nd ACM International Conference on Multimedia.2024:525-534. [98]HUANG W,LIU H,GUO M,et al.Visual hallucinations ofmulti-modal large language models[C]//Proceedings of the Association for Computational Linguistics.2024:9614-9631. [99]KAUL P,LI Z,YANG H,et al.THRONE:An object-based hallucination benchmark for the free-form generations of large vision-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:27228-27238. [100]CHANDU K R,LI L,AWADALLA A,et al.Certainly Uncertain:A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness[J].arXiv:2407.01942,2024. [101]CHEN X,WANG C,XUE Y,et al.Unified hallucination detection for multimodal large language models[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.2024:3235-3252. [102] GUNJAL A,YIN J,BAS E.Detecting and preventing hallucinations in large vision language models [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:18135-18143. [103]CHEN Z,ZHU Y,ZHAN Y,et al.Mitigating hallucination invisual language models with visual supervision[J].arXiv:2311.16479,2023. [104]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:common objects in context[C]//European Conference on Computer Vision.2014:740-755. [105]FU C,CHEN P,SHEN Y,et al.MME:a comprehensive evaluation benchmark for multimodal large language models[J].ar- Xiv:2306.13394,2023. [107]VILLA A,ALCÁZAR J C L,SOTO A,et al.Behind the magic,merlim:multi-modal evaluation benchmark for large image-language models[J].arXiv:2312.02219,2023. [108]ROHRBACH A,HENDRICKS L A,BURNS K,et al.Object hallucination in image captioning[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces- sing.2018:4035-4045. [109]BEN-KISH A,YANUKA M,ALERP M,et al.MOCHa:multi-objective reinforcement mitigating caption hallucinations[J].arXiv:2312.03631,2023. [110]LIU F,LIN K,LI L,et al.Mitigating hallucination in largemulti-modal models via robust instruction tuning[C]//The Twelfth International Conference on Learning Representations.2023:1-45. [111]JING L,LI R,CHEN Y,et al.FaithScore:fine-grained evaluations of hallucinations in large vision-language models[J]. ar- Xiv:2311.01477,2023. [112]WANG J,ZHOU Y,XU G,et al.Evaluation and analysis of hallucination in large vision-language models[J].arXiv:2308.15126,2023. [113] SUN Z,SHEN S,CAO S,et al.Aligning large multimodal mo-dels with factually augmented RLHF[C]//Proceedings of the Association for Computational Linguistics.2024:13088-13110. [114]GUAN T,LIU F,WU X,et al.Hallusion Bench:anadvanceddiagnostic suite for entangled language hallucination and visual illusion in large vision-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:14375-14385. [115]WANG J,WANG Y,XU G,et al.AMBER:an llm-free multi-dimensional benchmark for mllms hallucination evaluation[J].arXiv:2311.07397,2024. [116]CALDARELLA S,MANCINI M,RICCI E,et al.The phantom menace:unmasking privacy leakages in vision-language models[J].arXiv:2408.01228,2024. [117]CHEN Y,MENDES E,DAS S,et al.Can language models be instructed to protect personal information?[J].arXiv:2310.02224,2023. [118]GU T,ZHOU Z,HUANG K,et al.MLLMGuard:a multi-dimensional safety evaluation suite for mult-imodal large language models[J].arXiv:2406.07594,2024. [119]WANG S,YE X,CHENG Q,et al.Cross-modality safety alignment[J].arXiv:2406.151279,2024. [120]XIA P,CHEN Z,TIAN J,et al.CARES:a comprehensivebenchmark of trustworthiness in medical vision language models[J].arXiv:2406.06007,2024. [121]CAPITANI G,LUCARINI A,BONICELLI L,et al.Beyond the surface:comprehensive analysis of implicit bias in vision-language models[C]//Intervento Presentato Al Convegno Fairness and Ethics towards Transparent AI:Face the Challenge through Model Debiasing.2024. [122]SESHADRI P,SINGH S,ELAZAR Y.The bias amplification paradox in text-to-image generation[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2024:6367-6384. [123]WAN Y,CHANG K W.The male ceo and the female assistant:gender biases in text-to-image generation of dual subjects[J].arXiv:2402.11089,2024. [124]ZHONG Y,BAGHEL B K.Multimodal understanding of memes with fair explanations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:2007-2017. [125]CAI R,SONG Z,GUAN D,et al.BenchLMM:ben-chmarking cross-style visual capability of large multimodal models[J]. ar- Xiv:2312.02896,2023. [126]CUI P,WANG J.Out-of-distribution (OOD) detection based on deep learning:a review[J].Electronics,2022,11(21):3500. [127]ZHAO Y,PANG T,DU C,et al.On Evaluating Adversarial robustness of large vision-language models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2024:54111-54138 [128]KHATTAK M U,NAEEM M F,HASSAN J, et al.How good is my video lmm? complex video reasoning and robustness evaluation suite for video-lmms[J].arXiv:2405.03690,2024. [129]ZHOU K,LIU C,ZHAO X,et al.Multimodal Situational Safety[J].arXiv:2410.06172,2024. [130]WANG Y,TENG Y,HUANG K,et al.Fake Alignment:Are LLMs Really Aligned Well?[J].arXiv:2311.05915,2023. [131] YOU H,ZHANG H,GAN Z,et al.Ferret:refer and groundanything anywhere at any granularity[J].arXiv:2310.07704,2023. [132] WANG L,HE J,LI S,et al.Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites[C]//International Conference on Multimedia Modeling.Cham:Springer,2024:32-45. [133] CHEN Z,WU J,WANG W,et al.Internvl:scaling up vision foundation models and aligning for generic visual-linguistic tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:24185-24198. [134] ZHAI B,YANG S,ZHAO X,et al.HallE-Switch:rethinkingand controlling object existence hallucinations in large vision language models for detailed caption[J].arXiv:2310.01779,2023. [135] LI Z,YANG B,LIU Q,et al.Monkey:image resolution and text label are important things for large multi-modal models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:26763-26773. [136] HE X,WEI L,XIE L,et al.Incorporating visual experts to resolve the information loss in multimodal large language models[J].arXiv:2401.03105,2024. [137] JAIN J,YANG J,SHI H.Vcoder:versatile vision encoders for multimodal large language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:27992-28002. [138] ZHAO Y,LI Z,JIN Z,et al.Enhancing the spatial awareness capability of multi-modal large language model[J].arXiv:2310.20357,2023. [139] ZHAO Z,WANG B,OUYANG L,et al.Beyond hallucinations:enhancing lvlms through hallucination-aware direct preference optimization[J].arXiv:2311.16839,2023. [140] SUN Z,SHEN S,CAO S,et al.Aligning large multimodal mo-dels with factually augmented rlhf[C]//Proceedings of the Association for Computational Linguistics.2024:13088-13110. [141] JIANG C,XU H,DONG M,et al.Hallucination augmented contrastive learning for multimodal large language model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:27036-27046. [142] CHEN Z,WU J,WANG W,et al.Internvl:scaling up vision foundation models and aligning for generic visual-linguistic tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:24185-24198. [143] GUNJAL A,YIN J,BAS E.Detecting and preventing hallucinations in large vision language models [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:18135-18143. [144] LI L,XIE Z,LI M,et al.Silkie:preference distillation for large visual language models[J].arXiv:2312.10665,2023. [145] ZHOU Y,CUI C,RAFAILOV R,et al.Aligning modalities in vision large language models via preference fine-tuning [J]. ar- Xiv:2402.11411,2024. [146] DWORK C,ROTH A.The algorithmic foundations of differential privacy[J].Foundations and Trends in Theoretical Compu- ter Science,2014,9(3/4):211-407.[147] LIU Q,HUANG Y,JIN C,et al.Privacy and integrity protection for iot multimodal data using machine learning and blockchain[J].ACM Transactions on Multimedia Computing,Communications,an-d Applications,2024,20(6):1-18. [148] WANG L,SANG L,ZHANG Q,et al.A privacy-preserving framework with multi-modal data for cross-domain recommendation[J].Knowledge-Based Systems,2024,304:112529. [149] SAMSON L,BARAZANI N,GHEBREAB S,et al.Privacy-aware visual language models[J].arXiv:2405.17423,2024. [150] YIN L,LIN S,SUN Z,et al.PriMonitor:an adaptive tuning privacy-preserving approach for multimodal emotion detection[J].World Wide Web,2024,27(2):9. [151] CAO D,WU J,BASHIR A K.Multimodal large language mo-dels driven privacy-preserving wireless semantic communication in 6g[C]//IEEE International Conference on Communications Workshops.2024:171-176. [152] ALABDULMOHSIN I,WANG X,STEINER A,et al.CLIP the bias:how useful is balancing data in multimodal learning? [C]//International Conference on Learning Representations.2024:1-32. [153] CHENG H,GUO Y,GUO Q,et al.Social debiasing for fair multi-modal llms[J].arXiv:2408.06569,2024. [154] BERG H,HALL S M,BHALGAT Y,et al.A prompt array keeps the bias away:debiasing vision-language models with adversarial learning[C]//Proceedings of the 2nd Conference of the Asia-Pacif-ic Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing.2022:806-822. [155] JIANG J,MANORANJAN V,SALAM H,et al.Towards generalised and incremental bias mitigation in personality computing[J].IEEE Transactions on Affective Computing,2024,15(4):2192-2203. [156]WANG Z,LI X,QIN Z,et al.Can We Debias Multimodal Large Language Models via Model Editing?[C]//Proceedings of the 32nd ACM International Conference on Multimedia.2024:3219-3228. [157]BRINKMANN J,SWOBODA P,BARTELT C.A multidimen-sional analysis of social biases in vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:4914-4923. [158]CARLINI N,NASR M,CHOQUETTE-CHOO C A,et al.Are aligned neural networks adversarially al-igned? [C]// Procee- dings of the 37th International Conference on Neural Information Processing Systems.2023:61478-61500. [159]SHAYEGANI E,DONG Y,ABU-GHAZALEH N.Jailbreak in pieces:compositional adversarial attacks on multi-modal language models[J].arXiv:2307.14539,2023. [160]WANG R,MA X,ZHOU H,et al.White-box multimodal jailbreaks against large vision-language models[J].arXiv:2405.17894,2024. [161]YIN Z,YE M,ZHANG T,et al.VLATTACK:multimodal adversarial attacks on vision-language tasks via pre-trained models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.2023:52936-52956. [162]QI X,HUANG K,PANDA A,et al.Visual adversarial examples jailbreak aligned large language models[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:21527-21536. [163]NIU Z,REN H,GAO X,et al.Jailbreaking attack against multimodal large language model[J].arXiv:2402.02309,2024. [164]LUO H,GU J,LIU F,et al.An image is worth 1000 lies:adversarial transferability across prompts on vision-language models[J].arXiv:2403.09766,2024. [165]GU X,ZHENG X,PANG T,et al.Agent smith:a single image can jailbreak one million multimodal llm agents exponentially fast[J].arXiv:2402.08567,2024. [166]MA S,LUO W,WANG Y,et al.Visual-RolePlay:universal jailbreak attack on multimodal large language models via role-playing image character[J].arXiv:2405.20773,2024. [167]LI Y,GUO H,ZHOU K,et al.Images are achilles’ heel ofalignment:exploiting visual vulnerabilities for jailbreaking multimodal large language models[J].arXiv:2403.09792,2024. [168]LIU X,ZHU Y,GU J,et al.MM-SafetyBench:a benchmark for safety evaluation of multimodal large language models[J]. ar- Xiv:2311.17600,2023. [169]Madry A.Towards deep learning models resistant to adversarial attacks[J].arXiv:1706.06083,2017. [170]CROCE F,HEIN M.Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks[C]//International Conference on Machine Learning.PMLR,2020:2206-2216. [171]CARLINI N,WAGNER D.Towards evaluating the robustness of neural networks[C]//2017 IEEE Symposium on Security and Privacy.2017:39-57. [172] CUI X,APARCEDO A,JANG Y K,et al.On the robustness of large multimodal models against image adversarial attacks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:24625-24634. [173]LUO H,GU J,LIU F,et al.An image is worth 1000 lies:adversarial transferability across prompts on vision-language models[J].arXiv:2403.09766,2024. [174]GAO K,BAI Y,GU J,et al.Inducing high energy-latency of large vision-language models with verbose images[J].arXiv:2401.11170,2024. [175]FU X,WANG Z,LI S,et al.Misusing tools in large language models with visual adversarial examples[J].arXiv:2310.03185,2023. [176]WU X,CHAKRABORTY S,XIAN R,et al.Highlighting the safety concerns of deploying llms/vlms in robotics[J].arXiv:2402.10340,2024. [177]DONG Y,CHEN H,CHEN J,et al.How robust is google’s bard to adversarial image attacks?[J].arXiv:2309.11751,2023. [178]WANG X,JI Z,MA P,et al.InstructTA:instruction-tuned targeted attack for large vision-language models[J].arXiv:2312.01886,2023. [179]CHENG S,MIAO Y,DONG Y,et al.Efficient black-box adversarial attacks via bayesian optimization guided by a function prior [C]//Proceedings of the 41st International Conference on Machine Learning.PMLR,2024:8163-8183. [180]FRAZIER P I.A tutorial on Bayesian optimization[J].arXiv:1807.02811,2018. [181]CARLINI N,TERZIS A.Poisoning and backdooring contrastive learning[J].arXiv:2402.13851,2024. [182]LIANG J,LIANG S,LUO M,et al.VL-Trojan:multimodal instruction backdoor attacks against autoregressive visual language models[J].arXiv:2402.13851,2024. [183]NI Z,YE R,WEI Y,et al.Physical backdoor attack can jeopar-dize driving with vision-large-language models[J].arXiv:2404.12916,2024. [184]LU D,PANG T,DU C,et al.Test-time backdoor attacks onmultimodal large language models[J].arXiv:2402.08577,2024. [185]LIANG S,LIANG J,PANG T,et al.Revisiting backdoor at-tacks against large vision-language models[J].arXiv:2406.18844,2024. [186]CHEN C,HUANG B,LI Z,et al.Can editing llms inject harm?[J].arXiv:2407.20224,2024. [187]CHENG S Y,TIAN B Z,LIU Q B,et al.Can we edit multimodal large language models?[C]//Proceedings of the 2023 Confe- rence on Empirical Methods in Natural Language Processing.2023:13877-13888. [188]WANG Y,LIU X,LI Y,et al.AdaShield:safeguarding multimodal large language models from structure-based attack via adaptive shield prompting[C]//European Conference on Computer Vision (ECCV).2024:1-25. [189]ZHANG X,ZHANG C,LI T,et al.Jailguard:A universal detection framework for llm prompt-based attacks[J].arXiv:2312.10766,2024. [190]PI R,HAN T,ZHANG J,et al.MLLM-Protecor:ensuringmllm’s safety without hurting performance[J].arXiv:2401.02906,2024. [191]CHEN Y,SIKKA K,COGSWELL M,et al.Dress:instructing large vision-language models to align and interact with humans via natural language feedback[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:14239-14250. [192]DENG J,DONG W,SOCHER R,et al.ImageNet:A Large-Scale Hierarchical Image Database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.2009:248-255. [193]PLUMMER B A,WANG L,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2641-2649. [194]GURARI D,LI Q,STANGL A J,et al.Vizwiz grand challenge:Answering visual questions from blind people[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3608-3617. [195]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913. [196]MARINO K,RASTEGARI M,FARHADI A,et al.Ok-vqa:A visual question answering benchmark requiring external know- ledge[C]//Proceedings of the IEEE/cvf Conference on Compu- ter Vision and Pattern Recognition.2019:3195-3204. [197]LIN H,LUO Z,WANG B,et al.Goat-bench:Safety insights to large multimodal models through meme-based social abuse[J].arXiv:2401.01523,2024. [198]WANG X,YI X,JIANG H,et al.ToViLaG:your visual-lan-guage generative model is also an evildoer [C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:3508-3533. [199]YING Z,LIU A,LIANG S,et al.SafeBench:a safety evaluation framework for multimodal large language models[J].arXiv:2410.18927,2024. [200]LIU X,ZHU Y,GU J,et al.MM-SafetyBench:a benchmark for safety evaluation of multimodal large language models[J]. ar- Xiv:2311.17600,2023. [201]LI M,LI L,YIN Y,et al.Red teaming visual language models[J].arXiv:2401.12915,2024. [202]WU Y,LI X,LIU Y,et al.Jailbreaking gpt-4v via self-adversarial attacks with system prompts[J].arXiv:2311.09127,2023. [203]BAILEY L,ONG E,RUSSELL S,et al.Image hijacks:Adversarial images can control generative models at runtime[J]. ar- Xiv:2309.00236,2023. [204]VAN M H,WU X.Detecting and correcting hate speech in multimodal memes with large visual language model[J].arXiv:2311.06737,2023. [205]SCHLARMANN C,HEIN M.On the adversarial robustness of multi-modal foundation models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:3677-3685. [206]JI Y,GE C,KONG W,et al.Large language models as automated aligners for benchmarking vision-language models[J]. ar- Xiv:2311.14580,2023. [207]GUO Q,PANG S,JIA X,et al.Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models[J].arXiv:2404.10335,2024. [208]FAN Y,CAO Y,ZHAO Z,et al.Unbridled icarus:a survey of the potential perils of image inputs in multimodal large language model security[J].arXiv:2407.12784,2024. [209]ZHAO S,YANG Y,WANG Z,et al.Retrieval augmented gene-ration (rag) and beyond:a comprehensive survey on how to make your llms use external data more wisely[J].arXiv:2409.14924,2024. [210]ZHANG B,TAN Y,SHEN Y,et al.Breaking agents:compromising autonomous llm agents through malfunction amplification[J].arXiv:2407.20859,2024. [211]WANG Y,XUE D,ZHANG S,et al.BadAgent:inserting andactivating backdoor attacks in llm agents[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.2024:9811-9827 [212]CHEN Z,XIANG Z,XIAO C,et al.Agentpoison:red-teaming llm agents via poisoning memory or knowledge bases[J].arXiv:2407.12784,2024. [213]HINTERSDORF D,STRUPPEK L,BRACK M,et al.Does clip know my face?[J].Journal of Artificial Intelligence Research,2024,80:1033-1062. |
[1] | ZHANG Guanghua, CHEN Fang, CHANG Jiyou, HU Boning, WANG He. Accelerating Firmware Vulnerability Discovery Through Precise Localization of IntermediateTaint Sources and Dangerous Functions [J]. Computer Science, 2025, 52(7): 379-387. |
[2] | SUN Qiming, HOU Gang, JIN Wenjie, HUANG Chen, KONG Weiqiang. Survey on Fuzzing of Embedded Software [J]. Computer Science, 2025, 52(7): 13-25. |
[3] | YU Yiming, CHEN Yuanzhi, LANG Jun. Analysis of DNS Threats and the Challenges of DNS Security [J]. Computer Science, 2025, 52(6A): 240900140-8. |
[4] | ZHU Keda, CAI Ruijie, LIU Shengli. Large Scale Network Defense Algorithm Based on Temporal Network Flow Watermarking Technology [J]. Computer Science, 2025, 52(6A): 240900110-6. |
[5] | HUANG Xiaoyu, JIANG Hemeng, LING Jiaming. Privacy Preservation of Crowdsourcing Content Based on Adversarial Generative Networks [J]. Computer Science, 2025, 52(6A): 250200123-7. |
[6] | XIA Zhuoqun, ZHOU Zihao, DENG Bin, KANG Chen. Security Situation Assessment Method for Intelligent Water Resources Network Based on ImprovedD-S Evidence [J]. Computer Science, 2025, 52(6A): 240600051-6. |
[7] | LI Weifeng, XIE Jiangping. Study on System Security Testing Method Based on Digital Twin [J]. Computer Science, 2025, 52(6A): 240700068-7. |
[8] | WANG Yun, ZHAO Jianming, GUO Yifeng, ZHOU Huanhuan, ZHOU Wuai, ZHANG Wanzhe, FENG Jianhua. Automation and Security Strategies and Empirical Research on Operation and Maintenance of Digital Government Database [J]. Computer Science, 2025, 52(6A): 240500045-8. |
[9] | GAO Xinjun, ZHANG Meixin, ZHU Li. Study on Short-time Passenger Flow Data Generation and Prediction Method for RailTransportation [J]. Computer Science, 2025, 52(6A): 240600017-5. |
[10] | ZHANG Yaolin, LIU Xiaonan, DU Shuaiqi, LIAN Demeng. Hybrid Quantum-classical Compressed Generative Adversarial Networks Based on Matrix Product Operators [J]. Computer Science, 2025, 52(6): 74-81. |
[11] | LIU Huayong, ZHU Ting. Semi-supervised Cross-modal Hashing Method for Semantic Alignment Networks Basedon GAN [J]. Computer Science, 2025, 52(6): 159-166. |
[12] | WEI Youyuan, SONG Jianhua, ZHANG Yan. Survey of Binary Code Similarity Detection Method [J]. Computer Science, 2025, 52(6): 365-380. |
[13] | KANG Kai, WANG Jiabao, XU Kun. Balancing Transferability and Imperceptibility for Adversarial Attacks [J]. Computer Science, 2025, 52(6): 381-389. |
[14] | ZHENG Xu, HUANG Xiangjie, YANG Yang. Reversible Facial Privacy Protection Method Based on “Invisible Masks” [J]. Computer Science, 2025, 52(5): 384-391. |
[15] | WANG Yifei, ZHANG Shengjie, XUE Dizhan, QIAN Shengsheng. Self-supervised Backdoor Attack Defence Method Based on Poisoned Classifier [J]. Computer Science, 2025, 52(4): 336-342. |
|