计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 34-41.doi: 10.11896/jsjkx.240400190
包泽芃, 钱铁云
BAO Zepeng, QIAN Tieyun
摘要: 大模型红队测试(Large Model Red Teaming)旨在让大语言模型(Large Language Model,LLM)接收对抗测试,从而诱使模型输出有害的测试用例,进而发现模型中的漏洞并提高其鲁棒性。大模型红队测试是大模型领域的前沿课题,近年来受到学术界和工业界的广泛关注。研究者们针对大模型红队测试提出了众多解决方案,并在模型对齐上取得了一定进展。然而,受限于大模型红队数据的短缺和评价标准的模糊,现有研究大多局限于针对特定的场景进行评估。文中首先从与大模型安全相关的定义出发,对其所涉及的各种风险进行阐述;其次,针对大模型红队测试的重要性及其主要类别进行了阐述,综述和分析了相关红队技术的发展历程,并介绍了已有的数据集和评价指标;最后,对大模型红队测试的未来发展趋势进行了展望和总结。
中图分类号:
[1]WEI J,TAY Y,BOMMASANI R,et al.Emergent abilities of large language models[J].arXiv:2206.07682,2022. [2]ZHANG W M,WANG Z Y,LI Y G,et al.Introduction to computing[M].Beijing:Beijing Institute of Technology Press.2016. [3]DING C Y.Legal Regulation of the Network Society[M].Beijing:China University of Political Science & Law Press.2016. [4]WEIDINGER L,UESATO J,RAUH M,et al.Taxonomy of risksposed by language models[C]//Proceedings of the 2022 ACM Conference on Fairness,Accountability,and Transparency.2022:214-229. [5]JONES E,DRAGAN A,RAGHUNATHAN A,et al.Automatically Auditing Large Language Models via Discrete Optimization[J].arXiv:2303.04381,2023. [6]CHAN A,SALGANIK R,MARKELIUS A,et al.Harms from Increasingly Agentic Algorithmic Systems[C]//Proceedings of the 2023 ACM Conference on Fairness,Accountability,and Transparency.2023:651-666. [7]HENDRYCKS D,MAZEIKA M,WOODSIDE T.An Overview of Catastrophic AI Risks[J].arXiv:2306.12001,2023. [8]KOUR G,ZALMANOVICI M,ZWERDLING N,et al.UnveilingSafety Vulnerabilities of Large Language Models[J].arXiv:2311.04124,2023. [9]INIE N,STRAY J,DERCZYNSKI L.Summon a Demon and Bind it:A Grounded Theory of LLM Red Teaming in the Wild[J].arXiv:2311.06237,2023. [10]CASPER S,LIN J,KWON J,et al.Explore,Establish,Exploit:Red Teaming Language Models from Scratch[J].arXiv:2306.09442,2023. [11]White Paper on Artificial Intelligence Safety Standardisation[EB/OL].[2023-11-20]https://www.tc260.org.cn/upload/2023-05-31/1685501487351066337.pdf. [12]Basic Requirements for the Safety of Generative AI Services[EB/OL].[2023-11-20]https://www.tc260.org.cn/upload/2023-08-25/1692961404507050376.pdf. [13]Code of Ethics for the Next Generation of Artificial Intelligence[EB/OL].[2023-11-20].https://www.most.gov.cn/kjbgz/202109/t20210926_177063.html. [14]VON STENGEL B,KOLLER D.Team-maxmin equilibria[J].Games and Economic Behavior,1997,21(1/2):309-321. [15]COHEN F.Managing network security-red teaming[J].Network Security,1998(3):13-15. [16]JI J,QIU T,CHEN B,et al.Ai alignment:A comprehensive survey[J].arXiv:2310.19852,2023. [17]PEREZ E,HUANG S,SONG F,et al.Red teaming language models with language models[J].arXiv:2202.03286,2022. [18]CHAKRABORTY A,ALAM M,DEY V,et al.A survey on adversarial attacks and defences[J].CAAI Transactions on Intelligence Technology,2021,6(1):25-45. [19]LIU Y,DENG G,XU Z,et al.Jailbreaking chatgpt via prompt engineering:An empirical study[J].arXiv:2305.13860,2023. [20]CHEN Z,LI B,WU S,et al.Content-based Unrestricted Adversarial Attack[J].arXiv:2305.10665,2023. [21]WEI A,HAGHTALAB N,STEINHARDT J.Jailbroken:How does llm safety training fail?[J].arXiv:2307.02483,2023. [22]OpenAI.GPT-4 technical report[J].arXiv:2303.08774,2023. [23]PEARCE W,LUCAS J.Nvidia ai red team:An introduction[EB/OL].[2023-11-20].https://developer.nvidia.com/blog/nvidia-ai-red-team-an-introduction/. [24]KUMAR R S S.Microsoft ai red team building future of safer ai[EB/OL].[2023-11-20].https://www.microsoft.com/en-us/security/blog/2023/08/07/microsoft-ai-red-team-building-future-of-safer-ai/. [25]FABIAN D.Google’s ai red team:the ethical hackers making ai safer [EB/OL].[2023-11-20].https://blog.google/technolo-gy/safety-security/googles-ai-red-team-the-ethical-hackers-ma-king-ai-safer/. [26]Plan for red-team testing of the large language model(LLM) and its applications[EB/OL].[2024-05-23].https://learn.microsoft.com/zh-cn/azure/ai-services/openai/concepts/red-teaming. [27]LIU Y,ZHANG K,LI Y,et al.Sora:A Review on Background,Technology,Limitations,and Opportunities of Large Vision Models[J].arXiv:2402.17177,2024. [28]The first cybersecurity benchmarking platform in China,SecBench,has been released [J].China Information Security,2024(2):83. [29]RANDO J,PALEKA D,LINDNER D,et al.Red-teaming thestable diffusion safety filter[J].arXiv:2210.04610,2022. [30]GOODFELLOW I J,SHLENS J,SZEGEDY C.Explaining and harnessing adversarial examples[J].arXiv:1412.6572,2014. [31]XU J,JU D,LI M,et al.Bot-adversarial dialogue for safe conversational agents[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2021:2950-2968. [32]KANG D,LI X,STOICA I,et al.Exploiting programmatic behavior of llms:Dual-use through standard security attacks[J].arXiv:2302.05733,2023. [33]SU H,CHENG C C,FARN H,et al.Learning from Red Teaming:Gender Bias Provocation and Mitigation in Large Language Models[J].arXiv:2310.11079,2023. [34]WEN J,KE P,SUN H,et al.Unveiling the Implicit Toxicity in Large Language Models[J].arXiv:2311.17391,2023. [35]DING P,KUANG J,MA D,et al.A Wolf in Sheep’s Clothing:Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily[J].arXiv:2311.08268,2023. [36]CHAO P,ROBEY A,DOBRIBAN E,et al.Jailbreaking blackbox large language models in twenty queries[J].arXiv:2310.08419,2023. [37]GEHMAN S,GURURANGAN S,SAP M,et al.Realtoxici-typrompts:Evaluating neural toxic degeneration in language models[J].arXiv:2009.11462,2020. [38]LEE D,LEE J Y,HA J W,et al.Query-Efficient Black-Box Red Teaming via Bayesian Optimization[J].arXiv:2305.17444,2023. [39]GE S,ZHOU C,HOU R,et al.MART:Improving LLM Safety with Multi-round Automatic Red-Teaming[J].arXiv:2311.07689,2023. [40]BHARDWAJ R,PORIA S.Red-teaming large language models using chain of utterances for safety-alignment[J].arXiv:2308.09662,2023. [41]DENG B,WANG W,FENG F,et al.Attack Prompt Generation for Red Teaming and Defending Large Language Models[J].arXiv:2310.12505,2023. [42]GILARDI F,ALIZADEH M,KUBLI M.Chatgpt outperformscrowd-workers for text-annotation tasks[J].arXiv:2303.15056,2023. [43]GANGULI D,LOVITT L,KERNION J,et al.Red teaming lan-guage models to reduce harms:Methods,scaling behaviors,and lessons learned[J].arXiv:2209.07858,2022. [44]LIU X,XU N,CHEN M,et al.Autodan:Generating stealthyjailbreak prompts on aligned large language models[J].arXiv:2310.04451,2023. [45]ZHANG Z,CHENG J,SUN H,et al.Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation[J].arXiv:2212.01810,2022. [46]JI J,LIU M,DAI J,et al.Beavertails:Towards improved safety alignment of llm via a human-preference dataset[J].arXiv:2307.04657,2023. [47]ASKELL A,BAI Y,CHEN A,et al.A general language assistant as a laboratory for alignment[J].arXiv:2112.00861,2021. [48]ZOU A,WANG Z,KOLTER J Z,et al.Universal and transferable adversarial attacks on aligned language models[J].arXiv:2307.15043,2023. [49]LIN S,HILTON J,EVANS O.Truthfulqa:Measuring howmodels mimic human falsehoods[J].arXiv:2109.07958,2021. [50]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-Bench and Chatbot Arena[J].arXiv:2306.05685,2023. [51]HUANG Y,ZHANG Q,SUN L.TrustGPT:A Benchmark forTrustworthy and Responsible Large Language Models[J].ar-Xiv:2306.11507,2023. [52]WANG B,XU C,WANG S,et al.Adversarial glue:A multi-task benchmark for robustness evaluation of language models[J].arXiv:2111.02840,2021. [53]RUAN Y,DONG H,WANG A,et al.Identifying the risks of lm agents with an lm-emulated sandbox[J].arXiv:2309.15817,2023. |
|