计算机科学 ›› 2025, Vol. 52 ›› Issue (10): 201-207.doi: 10.11896/jsjkx.240800148
熊卓帜, 顾洲洪, 冯红伟, 肖仰华
XIONG Zhuozhi, GU Zhouhong, FENG Hongwei, XIAO Yanghua
摘要: 基于选择题(Multiple Choice Question,MCQ)形式的预训练语言模型学科知识评估方法虽然能够快速、量化地评估模型的知识,但其可靠性受到选项顺序和选项长度等无关因素的干扰,存在稳健性问题。为解决这一挑战,首先提出了一个基于MCQ的预训练语言模型学科知识评测分析框架。该框架将MCQ评测方法形式化为提示和解析两个模块,系统探索了不同类别的MCQ模型评测方法对评测结果的影响。在中英文学科知识评测数据集上的实验分析了不同提示和解析方法的稳健性。基于分析结果,提出了一种改写增强的解析方法,该方法通过引入预训练语言模型对模型回复进行改写,有效解决了传统基于规则解析的方法在处理非标准回复时的局限性。通过结合改写和规则解析,不仅提高了答案提取的准确性,还增强了评测过程的稳健性,为语言模型评测提供了一种新的有效途径。
中图分类号:
[1]HENDRYCKS D,BURNS C,BASART S,et al.Measuring Massive Multitask Language Understanding[C]//International Conference on Learning Representations.2021. [2]HUANG Y,BAI Y,ZHU Z,et al.C-eval:A multi-level multi-disciplinechinese evaluation suite for foundation models[C]//Advances in Neural Information Processing Systems.2023. [3]LI H,ZHANG Y,KOTO F,et al.Cmmlu:Measuring massivemultitask language understanding in Chinese[C]//Findings of the Association for Computational Linguistics.ACL,2024:11260-11285. [4]PEZESHKPOUR P,HRUSCHKA E.Large language modelssensitivity to the order of options in multiple-choice questions [C]//Findings of the Association for Computational Linguistics:NAACL 2024.ACL,2024:2006-2017. [5]ZHENG C,ZHOU H,MENG F,et al.Large language models are not robust multiple choice selectors[C]//The Twelfth International Conference on Learning Representations.2024. [6]ZHANG X,LI C,ZONG Y,et al.Evaluating the performance of large language models on gaokao benchmark[J].arXiv:2305.12474,2023. [7]CLARK P,COWHEY I,ETZIONI O,et al.Think you havesolved question answering? try arc,the ai2 reasoning challenge[J].arXiv:1803.05457,2018. [8]SCHERRER N,SHI C,FEDER A,et al.Evaluating the moral beliefs encoded inllms[C]//Advances in Neural Information Processing Systems.2024. [9]HU J,LEVY R.Prompt-based methods may underestimatelarge language models' linguistic generalizations[J].arXiv:2305.13264,2023. [10]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Openfoundation and fine-tuned chat models[J].arXiv:2307.09288,2023. [11]DU Z,QIAN Y,LIU X,et al.Glm:General language model pretraining with autoregressive blank infilling[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics ACL,2022:320-335. [12]ZENG A,LIU X,DU Z,et al.Glm-130b:An open bilingual pre-trained model [J].arXiv:2210.02414,2022. [13]CHATGPT I.Introducingchatgpt[EB/OL].https://openai.com/index/chatgpt/. [14]OPENAI.Gpt-4 technical report[J].arXiv:2303.08774,2023. [15]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-thoughtprompting elicits reasoning in large language models[J].Advances in Neural Information Processing Systems,2022,35:24824-24837. [16]ZENG Z,YU J,GAO T,et al.Evaluating large language models at evaluating instruction following[C]//The Twelfth International Conference on Learning Representations.2024. [17]DU Y,LI S,TORRALBA A,et al.Improving factuality and reasoning in language models through multiagent debate[C]//Forty-first International Conference on Machine Learning.2024. [18]GU Z,ZHU X,YE H,et al.Xiezhi:An ever-updating bench-mark for holistic domain knowledge evaluation[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2024:18099-18107. |
|