大模型红队测试研究综述

doi:10.11896/jsjkx.240400190

Abstract

Abstract: Large model red teaming is an emerging frontier in the field of large language model(LLM),which aims to allow the LLM to receive adversarial testing to induce the model to output harmful test cases,so as to find vulnerabilities in the model and improve its robustness.In recent years,large model red teaming has gained widespread attention from both academia and industry,and numerous solutions have been proposed and some progress has been made in model alignment.However,due to the scarcity of large model red teaming data and the lack of clear evaluation standards,most existing research has been limited to specific scenarios.In this paper,starting from definition of large model security,we discuss the various risks associated with it.Then,we discuss the importance of large model red teaming and its main categories,providing a comprehensive overview and analysis of the development of related red team techniques.Additionally,we introduce existing datasets and evaluation metrics.Finally,the future research trends of large model red teaming are prospected and summarized.

Key words: Red team, LLM safety, Reinforcement learning, Language model, Jailbreak

CLC Number:

TP391

BAO Zepeng, QIAN Tieyun. Survey on Large Model Red Teaming[J].Computer Science, 2025, 52(1): 34-41.

References

[1]WEI J,TAY Y,BOMMASANI R,et al.Emergent abilities of large language models[J].arXiv:2206.07682,2022.
[2]ZHANG W M,WANG Z Y,LI Y G,et al.Introduction to computing[M].Beijing:Beijing Institute of Technology Press.2016.
[3]DING C Y.Legal Regulation of the Network Society[M].Beijing:China University of Political Science & Law Press.2016.
[4]WEIDINGER L,UESATO J,RAUH M,et al.Taxonomy of risksposed by language models[C]//Proceedings of the 2022 ACM Conference on Fairness,Accountability,and Transparency.2022:214-229.
[5]JONES E,DRAGAN A,RAGHUNATHAN A,et al.Automatically Auditing Large Language Models via Discrete Optimization[J].arXiv:2303.04381,2023.
[6]CHAN A,SALGANIK R,MARKELIUS A,et al.Harms from Increasingly Agentic Algorithmic Systems[C]//Proceedings of the 2023 ACM Conference on Fairness,Accountability,and Transparency.2023:651-666.
[7]HENDRYCKS D,MAZEIKA M,WOODSIDE T.An Overview of Catastrophic AI Risks[J].arXiv:2306.12001,2023.
[8]KOUR G,ZALMANOVICI M,ZWERDLING N,et al.UnveilingSafety Vulnerabilities of Large Language Models[J].arXiv:2311.04124,2023.
[9]INIE N,STRAY J,DERCZYNSKI L.Summon a Demon and Bind it:A Grounded Theory of LLM Red Teaming in the Wild[J].arXiv:2311.06237,2023.
[10]CASPER S,LIN J,KWON J,et al.Explore,Establish,Exploit:Red Teaming Language Models from Scratch[J].arXiv:2306.09442,2023.
[11]White Paper on Artificial Intelligence Safety Standardisation[EB/OL].[2023-11-20]https://www.tc260.org.cn/upload/2023-05-31/1685501487351066337.pdf.
[12]Basic Requirements for the Safety of Generative AI Services[EB/OL].[2023-11-20]https://www.tc260.org.cn/upload/2023-08-25/1692961404507050376.pdf.
[13]Code of Ethics for the Next Generation of Artificial Intelligence[EB/OL].[2023-11-20].https://www.most.gov.cn/kjbgz/202109/t20210926_177063.html.
[14]VON STENGEL B,KOLLER D.Team-maxmin equilibria[J].Games and Economic Behavior,1997,21(1/2):309-321.
[15]COHEN F.Managing network security－red teaming[J].Network Security,1998(3):13-15.
[16]JI J,QIU T,CHEN B,et al.Ai alignment:A comprehensive survey[J].arXiv:2310.19852,2023.
[17]PEREZ E,HUANG S,SONG F,et al.Red teaming language models with language models[J].arXiv:2202.03286,2022.
[18]CHAKRABORTY A,ALAM M,DEY V,et al.A survey on adversarial attacks and defences[J].CAAI Transactions on Intelligence Technology,2021,6(1):25-45.
[19]LIU Y,DENG G,XU Z,et al.Jailbreaking chatgpt via prompt engineering:An empirical study[J].arXiv:2305.13860,2023.
[20]CHEN Z,LI B,WU S,et al.Content-based Unrestricted Adversarial Attack[J].arXiv:2305.10665,2023.
[21]WEI A,HAGHTALAB N,STEINHARDT J.Jailbroken:How does llm safety training fail?[J].arXiv:2307.02483,2023.
[22]OpenAI.GPT-4 technical report[J].arXiv:2303.08774,2023.
[23]PEARCE W,LUCAS J.Nvidia ai red team:An introduction[EB/OL].[2023-11-20].https://developer.nvidia.com/blog/nvidia-ai-red-team-an-introduction/.
[24]KUMAR R S S.Microsoft ai red team building future of safer ai[EB/OL].[2023-11-20].https://www.microsoft.com/en-us/security/blog/2023/08/07/microsoft-ai-red-team-building-future-of-safer-ai/.
[25]FABIAN D.Google’s ai red team:the ethical hackers making ai safer [EB/OL].[2023-11-20].https://blog.google/technolo-gy/safety-security/googles-ai-red-team-the-ethical-hackers-ma-king-ai-safer/.
[26]Plan for red-team testing of the large language model(LLM) and its applications[EB/OL].[2024-05-23].https://learn.microsoft.com/zh-cn/azure/ai-services/openai/concepts/red-teaming.
[27]LIU Y,ZHANG K,LI Y,et al.Sora:A Review on Background,Technology,Limitations,and Opportunities of Large Vision Models[J].arXiv:2402.17177,2024.
[28]The first cybersecurity benchmarking platform in China,SecBench,has been released [J].China Information Security,2024(2):83.
[29]RANDO J,PALEKA D,LINDNER D,et al.Red-teaming thestable diffusion safety filter[J].arXiv:2210.04610,2022.
[30]GOODFELLOW I J,SHLENS J,SZEGEDY C.Explaining and harnessing adversarial examples[J].arXiv:1412.6572,2014.
[31]XU J,JU D,LI M,et al.Bot-adversarial dialogue for safe conversational agents[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2021:2950-2968.
[32]KANG D,LI X,STOICA I,et al.Exploiting programmatic behavior of llms:Dual-use through standard security attacks[J].arXiv:2302.05733,2023.
[33]SU H,CHENG C C,FARN H,et al.Learning from Red Teaming:Gender Bias Provocation and Mitigation in Large Language Models[J].arXiv:2310.11079,2023.
[34]WEN J,KE P,SUN H,et al.Unveiling the Implicit Toxicity in Large Language Models[J].arXiv:2311.17391,2023.
[35]DING P,KUANG J,MA D,et al.A Wolf in Sheep’s Clothing:Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily[J].arXiv:2311.08268,2023.
[36]CHAO P,ROBEY A,DOBRIBAN E,et al.Jailbreaking blackbox large language models in twenty queries[J].arXiv:2310.08419,2023.
[37]GEHMAN S,GURURANGAN S,SAP M,et al.Realtoxici-typrompts:Evaluating neural toxic degeneration in language models[J].arXiv:2009.11462,2020.
[38]LEE D,LEE J Y,HA J W,et al.Query-Efficient Black-Box Red Teaming via Bayesian Optimization[J].arXiv:2305.17444,2023.
[39]GE S,ZHOU C,HOU R,et al.MART:Improving LLM Safety with Multi-round Automatic Red-Teaming[J].arXiv:2311.07689,2023.
[40]BHARDWAJ R,PORIA S.Red-teaming large language models using chain of utterances for safety-alignment[J].arXiv:2308.09662,2023.
[41]DENG B,WANG W,FENG F,et al.Attack Prompt Generation for Red Teaming and Defending Large Language Models[J].arXiv:2310.12505,2023.
[42]GILARDI F,ALIZADEH M,KUBLI M.Chatgpt outperformscrowd-workers for text-annotation tasks[J].arXiv:2303.15056,2023.
[43]GANGULI D,LOVITT L,KERNION J,et al.Red teaming lan-guage models to reduce harms:Methods,scaling behaviors,and lessons learned[J].arXiv:2209.07858,2022.
[44]LIU X,XU N,CHEN M,et al.Autodan:Generating stealthyjailbreak prompts on aligned large language models[J].arXiv:2310.04451,2023.
[45]ZHANG Z,CHENG J,SUN H,et al.Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation[J].arXiv:2212.01810,2022.
[46]JI J,LIU M,DAI J,et al.Beavertails:Towards improved safety alignment of llm via a human-preference dataset[J].arXiv:2307.04657,2023.
[47]ASKELL A,BAI Y,CHEN A,et al.A general language assistant as a laboratory for alignment[J].arXiv:2112.00861,2021.
[48]ZOU A,WANG Z,KOLTER J Z,et al.Universal and transferable adversarial attacks on aligned language models[J].arXiv:2307.15043,2023.
[49]LIN S,HILTON J,EVANS O.Truthfulqa:Measuring howmodels mimic human falsehoods[J].arXiv:2109.07958,2021.
[50]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-Bench and Chatbot Arena[J].arXiv:2306.05685,2023.
[51]HUANG Y,ZHANG Q,SUN L.TrustGPT:A Benchmark forTrustworthy and Responsible Large Language Models[J].ar-Xiv:2306.11507,2023.
[52]WANG B,XU C,WANG S,et al.Adversarial glue:A multi-task benchmark for robustness evaluation of language models[J].arXiv:2111.02840,2021.
[53]RUAN Y,DONG H,WANG A,et al.Identifying the risks of lm agents with an lm-emulated sandbox[J].arXiv:2309.15817,2023.

Related Articles 15

[1]	ZENG Zefan, HU Xingchen, CHENG Qing, SI Yuehang, LIU Zhong. Survey of Research on Knowledge Graph Based on Pre-trained Language Models [J]. Computer Science, 2025, 52(1): 1-33.
[2]	DUN Jingbo, LI Zhuo. Survey on Transmission Optimization Technologies for Federated Large Language Model Training [J]. Computer Science, 2025, 52(1): 42-55.
[3]	ZHENG Mingqi, CHEN Xiaohui, LIU Bing, ZHANG Bing, ZHANG Ran. Survey of Chain-of-Thought Generation and Enhancement Methods in Prompt Learning [J]. Computer Science, 2025, 52(1): 56-64.
[4]	LI Tingting, WANG Qi, WANG Jiakang, XU Yongjun. SWARM-LLM:An Unmanned Swarm Task Planning System Based on Large Language Models [J]. Computer Science, 2025, 52(1): 72-79.
[5]	YAN Yusong, ZHOU Yuan, WANG Cong, KONG Shengqi, WANG Quan, LI Minne, WANG Zhiyuan. COA Generation Based on Pre-trained Large Language Models [J]. Computer Science, 2025, 52(1): 80-86.
[6]	CHENG Zhiyu, CHEN Xinglin, WANG Jing, ZHOU Zhongyuan, ZHANG Zhizheng. Retrieval-augmented Generative Intelligence Question Answering Technology Based on Knowledge Graph [J]. Computer Science, 2025, 52(1): 87-93.
[7]	LIU Changcheng, SANG Lei, LI Wei, ZHANG Yiwen. Large Language Model Driven Multi-relational Knowledge Graph Completion Method [J]. Computer Science, 2025, 52(1): 94-101.
[8]	WANG Qidi, SHEN Liwei, WU Tianyi. Option Discovery Method Based on Symbolic Knowledge [J]. Computer Science, 2025, 52(1): 277-288.
[9]	WANG Yanning, ZHANG Fengdi, XIAO Dengmin, SUN Zhongqi. Multi-agent Pursuit Decision-making Method Based on Hybrid Imitation Learning [J]. Computer Science, 2025, 52(1): 323-330.
[10]	DAI Chaofan, DING Huahua. Domain-adaptive Entity Resolution Algorithm Based on Semi-supervised Learning [J]. Computer Science, 2024, 51(9): 214-222.
[11]	YAN Xin, HUANG Zhiqiu, SHI Fan, XU Heng. Study on Following Car Model with Different Driving Styles Based on Proximal PolicyOptimization Algorithm [J]. Computer Science, 2024, 51(9): 223-232.
[12]	MO Shuyuan, MENG Zuqiang. Multimodal Sentiment Analysis Model Based on Visual Semantics and Prompt Learning [J]. Computer Science, 2024, 51(9): 250-257.
[13]	WANG Tianjiu, LIU Quan, WU Lan. Offline Reinforcement Learning Algorithm for Conservative Q-learning Based on Uncertainty Weight [J]. Computer Science, 2024, 51(9): 265-272.
[14]	ZHOU Wenhui, PENG Qinghua, XIE Lei. Study on Adaptive Cloud-Edge Collaborative Scheduling Methods for Multi-object State Perception [J]. Computer Science, 2024, 51(9): 319-330.
[15]	TIAN Sicheng, HUANG Shaobin, WANG Rui, LI Rongsheng, DU Zhijuan. Contrastive Learning-based Prompt Generation Method for Large-scale Language Model ReverseDictionary Task [J]. Computer Science, 2024, 51(8): 256-262.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey on Large Model Red Teaming

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0