计算机科学 ›› 2025, Vol. 52 ›› Issue (10): 201-207.doi: 10.11896/jsjkx.240800148

• 人工智能 • 上一篇    下一篇

基于选择题的语言模型学科知识评估方法

熊卓帜, 顾洲洪, 冯红伟, 肖仰华   

  1. 复旦大学计算机科学技术学院 上海 200433
  • 收稿日期:2024-08-28 修回日期:2024-10-31 出版日期:2025-10-15 发布日期:2025-10-14
  • 通讯作者: 肖仰华(shawyh@fudan.edu.cn)
  • 作者简介:(xiongzz21@m.fudan.edu.cn)

Subject Knowledge Evaluation Method for Language Models Based on Multiple ChoiceQuestions

XIONG Zhuozhi, GU Zhouhong, FENG Hongwei, XIAO Yanghua   

  1. School of Computer Science,Fudan University,Shanghai 200433,China
  • Received:2024-08-28 Revised:2024-10-31 Online:2025-10-15 Published:2025-10-14
  • About author:XIONG Zhuozhi,born in 1999,postgraduate.His main research interests include knowledge graph and natural language processing.
    XIAO Yanghua,born in 1980,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.12210D).His main research interests include knowledge graph and natural language processing.

摘要: 基于选择题(Multiple Choice Question,MCQ)形式的预训练语言模型学科知识评估方法虽然能够快速、量化地评估模型的知识,但其可靠性受到选项顺序和选项长度等无关因素的干扰,存在稳健性问题。为解决这一挑战,首先提出了一个基于MCQ的预训练语言模型学科知识评测分析框架。该框架将MCQ评测方法形式化为提示和解析两个模块,系统探索了不同类别的MCQ模型评测方法对评测结果的影响。在中英文学科知识评测数据集上的实验分析了不同提示和解析方法的稳健性。基于分析结果,提出了一种改写增强的解析方法,该方法通过引入预训练语言模型对模型回复进行改写,有效解决了传统基于规则解析的方法在处理非标准回复时的局限性。通过结合改写和规则解析,不仅提高了答案提取的准确性,还增强了评测过程的稳健性,为语言模型评测提供了一种新的有效途径。

关键词: 语言模型, 学科知识, 选择题评估

Abstract: Subject knowledge evaluation methods for pre-trained language models based on Multiple Choice Questions(MCQ) offer rapid,quantitative evaluation of model knowledge.However,their reliability is compromised by irrelevant factors such as option order and length,raising robustness concerns.To address this challenge,an analytical framework for evaluating subject knowledge of pre-trained language models using MCQ is proposed.This framework formalizes MCQ evaluation into two mo-dules:prompting and parsing,systematically investigating the impact of various MCQ evaluation methods on evaluation outcomes.The robustness of different prompting and parsing techniques is analyzed through experiments on Chinese and English subject knowledge evaluation datasets.Based on these findings,a rewriting-enhanced parsing method is introduced that employs pre-trained language models to rewrite model responses,effectively overcoming the limitations of traditional rule-based parsing when handling non-standard replies.By integrating rewriting and rule-based parsing,this approach enhances both answer extraction accuracy and evaluation process robustness,offering a novel and effective strategy for language model evaluation.

Key words: Language model,Subject knowledge,Multiple choice evaluation

中图分类号: 

  • TP391
[1]HENDRYCKS D,BURNS C,BASART S,et al.Measuring Massive Multitask Language Understanding[C]//International Conference on Learning Representations.2021.
[2]HUANG Y,BAI Y,ZHU Z,et al.C-eval:A multi-level multi-disciplinechinese evaluation suite for foundation models[C]//Advances in Neural Information Processing Systems.2023.
[3]LI H,ZHANG Y,KOTO F,et al.Cmmlu:Measuring massivemultitask language understanding in Chinese[C]//Findings of the Association for Computational Linguistics.ACL,2024:11260-11285.
[4]PEZESHKPOUR P,HRUSCHKA E.Large language modelssensitivity to the order of options in multiple-choice questions [C]//Findings of the Association for Computational Linguistics:NAACL 2024.ACL,2024:2006-2017.
[5]ZHENG C,ZHOU H,MENG F,et al.Large language models are not robust multiple choice selectors[C]//The Twelfth International Conference on Learning Representations.2024.
[6]ZHANG X,LI C,ZONG Y,et al.Evaluating the performance of large language models on gaokao benchmark[J].arXiv:2305.12474,2023.
[7]CLARK P,COWHEY I,ETZIONI O,et al.Think you havesolved question answering? try arc,the ai2 reasoning challenge[J].arXiv:1803.05457,2018.
[8]SCHERRER N,SHI C,FEDER A,et al.Evaluating the moral beliefs encoded inllms[C]//Advances in Neural Information Processing Systems.2024.
[9]HU J,LEVY R.Prompt-based methods may underestimatelarge language models' linguistic generalizations[J].arXiv:2305.13264,2023.
[10]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Openfoundation and fine-tuned chat models[J].arXiv:2307.09288,2023.
[11]DU Z,QIAN Y,LIU X,et al.Glm:General language model pretraining with autoregressive blank infilling[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics ACL,2022:320-335.
[12]ZENG A,LIU X,DU Z,et al.Glm-130b:An open bilingual pre-trained model [J].arXiv:2210.02414,2022.
[13]CHATGPT I.Introducingchatgpt[EB/OL].https://openai.com/index/chatgpt/.
[14]OPENAI.Gpt-4 technical report[J].arXiv:2303.08774,2023.
[15]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-thoughtprompting elicits reasoning in large language models[J].Advances in Neural Information Processing Systems,2022,35:24824-24837.
[16]ZENG Z,YU J,GAO T,et al.Evaluating large language models at evaluating instruction following[C]//The Twelfth International Conference on Learning Representations.2024.
[17]DU Y,LI S,TORRALBA A,et al.Improving factuality and reasoning in language models through multiagent debate[C]//Forty-first International Conference on Machine Learning.2024.
[18]GU Z,ZHU X,YE H,et al.Xiezhi:An ever-updating bench-mark for holistic domain knowledge evaluation[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2024:18099-18107.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!