Computer Science ›› 2025, Vol. 52 ›› Issue (10): 201-207.doi: 10.11896/jsjkx.240800148

• Artificial Intelligence • Previous Articles     Next Articles

Subject Knowledge Evaluation Method for Language Models Based on Multiple ChoiceQuestions

XIONG Zhuozhi, GU Zhouhong, FENG Hongwei, XIAO Yanghua   

  1. School of Computer Science,Fudan University,Shanghai 200433,China
  • Received:2024-08-28 Revised:2024-10-31 Online:2025-10-15 Published:2025-10-14
  • About author:XIONG Zhuozhi,born in 1999,postgraduate.His main research interests include knowledge graph and natural language processing.
    XIAO Yanghua,born in 1980,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.12210D).His main research interests include knowledge graph and natural language processing.

Abstract: Subject knowledge evaluation methods for pre-trained language models based on Multiple Choice Questions(MCQ) offer rapid,quantitative evaluation of model knowledge.However,their reliability is compromised by irrelevant factors such as option order and length,raising robustness concerns.To address this challenge,an analytical framework for evaluating subject knowledge of pre-trained language models using MCQ is proposed.This framework formalizes MCQ evaluation into two mo-dules:prompting and parsing,systematically investigating the impact of various MCQ evaluation methods on evaluation outcomes.The robustness of different prompting and parsing techniques is analyzed through experiments on Chinese and English subject knowledge evaluation datasets.Based on these findings,a rewriting-enhanced parsing method is introduced that employs pre-trained language models to rewrite model responses,effectively overcoming the limitations of traditional rule-based parsing when handling non-standard replies.By integrating rewriting and rule-based parsing,this approach enhances both answer extraction accuracy and evaluation process robustness,offering a novel and effective strategy for language model evaluation.

Key words: Language model,Subject knowledge,Multiple choice evaluation

CLC Number: 

  • TP391
[1]HENDRYCKS D,BURNS C,BASART S,et al.Measuring Massive Multitask Language Understanding[C]//International Conference on Learning Representations.2021.
[2]HUANG Y,BAI Y,ZHU Z,et al.C-eval:A multi-level multi-disciplinechinese evaluation suite for foundation models[C]//Advances in Neural Information Processing Systems.2023.
[3]LI H,ZHANG Y,KOTO F,et al.Cmmlu:Measuring massivemultitask language understanding in Chinese[C]//Findings of the Association for Computational Linguistics.ACL,2024:11260-11285.
[4]PEZESHKPOUR P,HRUSCHKA E.Large language modelssensitivity to the order of options in multiple-choice questions [C]//Findings of the Association for Computational Linguistics:NAACL 2024.ACL,2024:2006-2017.
[5]ZHENG C,ZHOU H,MENG F,et al.Large language models are not robust multiple choice selectors[C]//The Twelfth International Conference on Learning Representations.2024.
[6]ZHANG X,LI C,ZONG Y,et al.Evaluating the performance of large language models on gaokao benchmark[J].arXiv:2305.12474,2023.
[7]CLARK P,COWHEY I,ETZIONI O,et al.Think you havesolved question answering? try arc,the ai2 reasoning challenge[J].arXiv:1803.05457,2018.
[8]SCHERRER N,SHI C,FEDER A,et al.Evaluating the moral beliefs encoded inllms[C]//Advances in Neural Information Processing Systems.2024.
[9]HU J,LEVY R.Prompt-based methods may underestimatelarge language models' linguistic generalizations[J].arXiv:2305.13264,2023.
[10]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Openfoundation and fine-tuned chat models[J].arXiv:2307.09288,2023.
[11]DU Z,QIAN Y,LIU X,et al.Glm:General language model pretraining with autoregressive blank infilling[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics ACL,2022:320-335.
[12]ZENG A,LIU X,DU Z,et al.Glm-130b:An open bilingual pre-trained model [J].arXiv:2210.02414,2022.
[13]CHATGPT I.Introducingchatgpt[EB/OL].https://openai.com/index/chatgpt/.
[14]OPENAI.Gpt-4 technical report[J].arXiv:2303.08774,2023.
[15]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-thoughtprompting elicits reasoning in large language models[J].Advances in Neural Information Processing Systems,2022,35:24824-24837.
[16]ZENG Z,YU J,GAO T,et al.Evaluating large language models at evaluating instruction following[C]//The Twelfth International Conference on Learning Representations.2024.
[17]DU Y,LI S,TORRALBA A,et al.Improving factuality and reasoning in language models through multiagent debate[C]//Forty-first International Conference on Machine Learning.2024.
[18]GU Z,ZHU X,YE H,et al.Xiezhi:An ever-updating bench-mark for holistic domain knowledge evaluation[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2024:18099-18107.
[1] WANG Baocai, WU Guowei. Interpretable Credit Risk Assessment Model:Rule Extraction Approach Based on AttentionMechanism [J]. Computer Science, 2025, 52(10): 50-59.
[2] ZHENG Hanyuan, GE Rongjun, HE Shengji, LI Nan. Direct PET to CT Attenuation Correction Algorithm Based on Imaging Slice Continuity [J]. Computer Science, 2025, 52(10): 115-122.
[3] XU Hengyu, CHEN Kun, XU Lin, SUN Mingzhai, LU Zhou. SAM-Retina:Arteriovenous Segmentation in Dual-modal Retinal Image Based on SAM [J]. Computer Science, 2025, 52(10): 123-133.
[4] WEN Jing, ZHANG Songsong, LI Xufeng. Target Tracking Method Based on Cross Scale Fusion of Features and Trajectory Prompts [J]. Computer Science, 2025, 52(10): 144-150.
[5] SHENG Xiaomeng, ZHAO Junli, WANG Guodong, WANG Yang. Immediate Generation Algorithm of High-fidelity Head Avatars Based on NeRF [J]. Computer Science, 2025, 52(10): 159-167.
[6] ZHENG Dichen, HE Jikai, LIU Yi, GAO Fan, ZHANG Dengyin. Low Light Image Adaptive Enhancement Algorithm Based on Retinex Theory [J]. Computer Science, 2025, 52(10): 168-175.
[7] RUAN Ning, LI Chun, MA Haoyue, JIA Yi, LI Tao. Review of Quantum-inspired Metaheuristic Algorithms and Its Applications [J]. Computer Science, 2025, 52(10): 190-200.
[8] WANG Jian, WANG Jingling, ZHANG Ge, WANG Zhangquan, GUO Shiyuan, YU Guiming. Multimodal Information Extraction Fusion Method Based on Dempster-Shafer Theory [J]. Computer Science, 2025, 52(10): 208-216.
[9] CHEN Yuyan, JIA Jiyuan, CHANG Jingwen, ZUO Kaiwen, XIAO Yanghua. SPEAKSMART:Evaluating Empathetic Persuasive Responses by Large Language Models [J]. Computer Science, 2025, 52(10): 217-230.
[10] LI Sihui, CAI Guoyong, JIANG Hang, WEN Yimin. Novel Discrete Diffusion Text Generation Model with Convex Loss Function [J]. Computer Science, 2025, 52(10): 231-238.
[11] ZHANG Jiawei, WANG Zhongqing, CHEN Jiali. Multi-grained Sentiment Analysis of Comments Based on Text Generation [J]. Computer Science, 2025, 52(10): 239-246.
[12] CHEN Jiahao, DUAN Liguo, CHANG Xuanwei, LI Aiping, CUI Juanjuan, HAO Yuanbin. Text Sentiment Classification Method Based on Large-batch Adversarial Strategy and EnhancedFeature Extraction [J]. Computer Science, 2025, 52(10): 247-257.
[13] WANG Ye, WANG Zhongqing. Text Simplification for Aspect-based Sentiment Analysis Based on Large Language Model [J]. Computer Science, 2025, 52(10): 258-265.
[14] ZHAO Jinshuang, HUANG Degen. Summary Faithfulness Evaluation Based on Data Augmentation and Two-stage Training [J]. Computer Science, 2025, 52(10): 266-274.
[15] SUN Liangxu, LI Linlin, LIU Guoli. Sub-problem Effectiveness Guided Multi-objective Evolution Algorithm [J]. Computer Science, 2025, 52(10): 296-307.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!