计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 34-41.doi: 10.11896/jsjkx.240400190

• 大语言模型技术研究及应用 • 上一篇    下一篇

大模型红队测试研究综述

包泽芃, 钱铁云   

  1. 武汉大学计算机学院 武汉 430072
  • 收稿日期:2024-04-28 修回日期:2024-08-12 出版日期:2025-01-15 发布日期:2025-01-09
  • 通讯作者: 钱铁云(qty@whu.edu.cn)
  • 作者简介:(zepengbao@163.com)
  • 基金资助:
    国家自然科学基金(62276193)

Survey on Large Model Red Teaming

BAO Zepeng, QIAN Tieyun   

  1. School of Computer Science,Wuhan University,Wuhan 430072,China
  • Received:2024-04-28 Revised:2024-08-12 Online:2025-01-15 Published:2025-01-09
  • About author:BAO Zepeng,born in 2002,undergra-duate,is a member of CCF(No.U8466G).His main research interests include LLM safety and recommendation system.
    QIAN Tieyun,born in 1970,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.13483M).Her main research interests include web mining and natural language processing.
  • Supported by:
    National Natural Science Foundation of China(62276193).

摘要: 大模型红队测试(Large Model Red Teaming)旨在让大语言模型(Large Language Model,LLM)接收对抗测试,从而诱使模型输出有害的测试用例,进而发现模型中的漏洞并提高其鲁棒性。大模型红队测试是大模型领域的前沿课题,近年来受到学术界和工业界的广泛关注。研究者们针对大模型红队测试提出了众多解决方案,并在模型对齐上取得了一定进展。然而,受限于大模型红队数据的短缺和评价标准的模糊,现有研究大多局限于针对特定的场景进行评估。文中首先从与大模型安全相关的定义出发,对其所涉及的各种风险进行阐述;其次,针对大模型红队测试的重要性及其主要类别进行了阐述,综述和分析了相关红队技术的发展历程,并介绍了已有的数据集和评价指标;最后,对大模型红队测试的未来发展趋势进行了展望和总结。

关键词: 红队, 大模型安全, 强化学习, 语言模型, 越狱

Abstract: Large model red teaming is an emerging frontier in the field of large language model(LLM),which aims to allow the LLM to receive adversarial testing to induce the model to output harmful test cases,so as to find vulnerabilities in the model and improve its robustness.In recent years,large model red teaming has gained widespread attention from both academia and industry,and numerous solutions have been proposed and some progress has been made in model alignment.However,due to the scarcity of large model red teaming data and the lack of clear evaluation standards,most existing research has been limited to specific scenarios.In this paper,starting from definition of large model security,we discuss the various risks associated with it.Then,we discuss the importance of large model red teaming and its main categories,providing a comprehensive overview and analysis of the development of related red team techniques.Additionally,we introduce existing datasets and evaluation metrics.Finally,the future research trends of large model red teaming are prospected and summarized.

Key words: Red team, LLM safety, Reinforcement learning, Language model, Jailbreak

中图分类号: 

  • TP391
[1]WEI J,TAY Y,BOMMASANI R,et al.Emergent abilities of large language models[J].arXiv:2206.07682,2022.
[2]ZHANG W M,WANG Z Y,LI Y G,et al.Introduction to computing[M].Beijing:Beijing Institute of Technology Press.2016.
[3]DING C Y.Legal Regulation of the Network Society[M].Beijing:China University of Political Science & Law Press.2016.
[4]WEIDINGER L,UESATO J,RAUH M,et al.Taxonomy of risksposed by language models[C]//Proceedings of the 2022 ACM Conference on Fairness,Accountability,and Transparency.2022:214-229.
[5]JONES E,DRAGAN A,RAGHUNATHAN A,et al.Automatically Auditing Large Language Models via Discrete Optimization[J].arXiv:2303.04381,2023.
[6]CHAN A,SALGANIK R,MARKELIUS A,et al.Harms from Increasingly Agentic Algorithmic Systems[C]//Proceedings of the 2023 ACM Conference on Fairness,Accountability,and Transparency.2023:651-666.
[7]HENDRYCKS D,MAZEIKA M,WOODSIDE T.An Overview of Catastrophic AI Risks[J].arXiv:2306.12001,2023.
[8]KOUR G,ZALMANOVICI M,ZWERDLING N,et al.UnveilingSafety Vulnerabilities of Large Language Models[J].arXiv:2311.04124,2023.
[9]INIE N,STRAY J,DERCZYNSKI L.Summon a Demon and Bind it:A Grounded Theory of LLM Red Teaming in the Wild[J].arXiv:2311.06237,2023.
[10]CASPER S,LIN J,KWON J,et al.Explore,Establish,Exploit:Red Teaming Language Models from Scratch[J].arXiv:2306.09442,2023.
[11]White Paper on Artificial Intelligence Safety Standardisation[EB/OL].[2023-11-20]https://www.tc260.org.cn/upload/2023-05-31/1685501487351066337.pdf.
[12]Basic Requirements for the Safety of Generative AI Services[EB/OL].[2023-11-20]https://www.tc260.org.cn/upload/2023-08-25/1692961404507050376.pdf.
[13]Code of Ethics for the Next Generation of Artificial Intelligence[EB/OL].[2023-11-20].https://www.most.gov.cn/kjbgz/202109/t20210926_177063.html.
[14]VON STENGEL B,KOLLER D.Team-maxmin equilibria[J].Games and Economic Behavior,1997,21(1/2):309-321.
[15]COHEN F.Managing network security-red teaming[J].Network Security,1998(3):13-15.
[16]JI J,QIU T,CHEN B,et al.Ai alignment:A comprehensive survey[J].arXiv:2310.19852,2023.
[17]PEREZ E,HUANG S,SONG F,et al.Red teaming language models with language models[J].arXiv:2202.03286,2022.
[18]CHAKRABORTY A,ALAM M,DEY V,et al.A survey on adversarial attacks and defences[J].CAAI Transactions on Intelligence Technology,2021,6(1):25-45.
[19]LIU Y,DENG G,XU Z,et al.Jailbreaking chatgpt via prompt engineering:An empirical study[J].arXiv:2305.13860,2023.
[20]CHEN Z,LI B,WU S,et al.Content-based Unrestricted Adversarial Attack[J].arXiv:2305.10665,2023.
[21]WEI A,HAGHTALAB N,STEINHARDT J.Jailbroken:How does llm safety training fail?[J].arXiv:2307.02483,2023.
[22]OpenAI.GPT-4 technical report[J].arXiv:2303.08774,2023.
[23]PEARCE W,LUCAS J.Nvidia ai red team:An introduction[EB/OL].[2023-11-20].https://developer.nvidia.com/blog/nvidia-ai-red-team-an-introduction/.
[24]KUMAR R S S.Microsoft ai red team building future of safer ai[EB/OL].[2023-11-20].https://www.microsoft.com/en-us/security/blog/2023/08/07/microsoft-ai-red-team-building-future-of-safer-ai/.
[25]FABIAN D.Google’s ai red team:the ethical hackers making ai safer [EB/OL].[2023-11-20].https://blog.google/technolo-gy/safety-security/googles-ai-red-team-the-ethical-hackers-ma-king-ai-safer/.
[26]Plan for red-team testing of the large language model(LLM) and its applications[EB/OL].[2024-05-23].https://learn.microsoft.com/zh-cn/azure/ai-services/openai/concepts/red-teaming.
[27]LIU Y,ZHANG K,LI Y,et al.Sora:A Review on Background,Technology,Limitations,and Opportunities of Large Vision Models[J].arXiv:2402.17177,2024.
[28]The first cybersecurity benchmarking platform in China,SecBench,has been released [J].China Information Security,2024(2):83.
[29]RANDO J,PALEKA D,LINDNER D,et al.Red-teaming thestable diffusion safety filter[J].arXiv:2210.04610,2022.
[30]GOODFELLOW I J,SHLENS J,SZEGEDY C.Explaining and harnessing adversarial examples[J].arXiv:1412.6572,2014.
[31]XU J,JU D,LI M,et al.Bot-adversarial dialogue for safe conversational agents[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2021:2950-2968.
[32]KANG D,LI X,STOICA I,et al.Exploiting programmatic behavior of llms:Dual-use through standard security attacks[J].arXiv:2302.05733,2023.
[33]SU H,CHENG C C,FARN H,et al.Learning from Red Teaming:Gender Bias Provocation and Mitigation in Large Language Models[J].arXiv:2310.11079,2023.
[34]WEN J,KE P,SUN H,et al.Unveiling the Implicit Toxicity in Large Language Models[J].arXiv:2311.17391,2023.
[35]DING P,KUANG J,MA D,et al.A Wolf in Sheep’s Clothing:Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily[J].arXiv:2311.08268,2023.
[36]CHAO P,ROBEY A,DOBRIBAN E,et al.Jailbreaking blackbox large language models in twenty queries[J].arXiv:2310.08419,2023.
[37]GEHMAN S,GURURANGAN S,SAP M,et al.Realtoxici-typrompts:Evaluating neural toxic degeneration in language models[J].arXiv:2009.11462,2020.
[38]LEE D,LEE J Y,HA J W,et al.Query-Efficient Black-Box Red Teaming via Bayesian Optimization[J].arXiv:2305.17444,2023.
[39]GE S,ZHOU C,HOU R,et al.MART:Improving LLM Safety with Multi-round Automatic Red-Teaming[J].arXiv:2311.07689,2023.
[40]BHARDWAJ R,PORIA S.Red-teaming large language models using chain of utterances for safety-alignment[J].arXiv:2308.09662,2023.
[41]DENG B,WANG W,FENG F,et al.Attack Prompt Generation for Red Teaming and Defending Large Language Models[J].arXiv:2310.12505,2023.
[42]GILARDI F,ALIZADEH M,KUBLI M.Chatgpt outperformscrowd-workers for text-annotation tasks[J].arXiv:2303.15056,2023.
[43]GANGULI D,LOVITT L,KERNION J,et al.Red teaming lan-guage models to reduce harms:Methods,scaling behaviors,and lessons learned[J].arXiv:2209.07858,2022.
[44]LIU X,XU N,CHEN M,et al.Autodan:Generating stealthyjailbreak prompts on aligned large language models[J].arXiv:2310.04451,2023.
[45]ZHANG Z,CHENG J,SUN H,et al.Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation[J].arXiv:2212.01810,2022.
[46]JI J,LIU M,DAI J,et al.Beavertails:Towards improved safety alignment of llm via a human-preference dataset[J].arXiv:2307.04657,2023.
[47]ASKELL A,BAI Y,CHEN A,et al.A general language assistant as a laboratory for alignment[J].arXiv:2112.00861,2021.
[48]ZOU A,WANG Z,KOLTER J Z,et al.Universal and transferable adversarial attacks on aligned language models[J].arXiv:2307.15043,2023.
[49]LIN S,HILTON J,EVANS O.Truthfulqa:Measuring howmodels mimic human falsehoods[J].arXiv:2109.07958,2021.
[50]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-Bench and Chatbot Arena[J].arXiv:2306.05685,2023.
[51]HUANG Y,ZHANG Q,SUN L.TrustGPT:A Benchmark forTrustworthy and Responsible Large Language Models[J].ar-Xiv:2306.11507,2023.
[52]WANG B,XU C,WANG S,et al.Adversarial glue:A multi-task benchmark for robustness evaluation of language models[J].arXiv:2111.02840,2021.
[53]RUAN Y,DONG H,WANG A,et al.Identifying the risks of lm agents with an lm-emulated sandbox[J].arXiv:2309.15817,2023.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!