计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241000129-7.doi: 10.11896/jsjkx.241000129
梁秉豪, 张传刚, 袁明明
LIANG Binghao, ZHANG Chuangang, YUAN Mingming
摘要: 随着ChatGPT的出现,大模型已经成为全球科技竞争的新赛道,并开始广泛应用于生产和生活的各个环节。国内众多科技公司纷纷投入到大模型研发和开源工作中。大模型应用场景不断拓展,可供下载或调用的预训练大模型类型和数量越来越多,用户对于大模型测评的需求逐渐增加。目前面向大模型测评还未形成标准化的方法,业界主要通过第三方机构提供的测评榜单对大模型能力进行横向对比。大模型在特定应用场景下的实际效果仍缺少有效的测评手段。文章针对预训练大模型在垂直行业场景下的应用效果测评,特别是面向开放性问题的回答能力进行研究,提出了一套交叉测评方法,并对其可靠性和鲁棒性进行了实验验证。实验结果表明,所提交叉测评方法测评结果与官方给出结果一致性较高,说明该方法具有较强的可靠性。所提方法有效提高了大模型测评结果的客观性和便捷性,有助于用户在个性化场景中快速完成大模型的横向对比和选型。
中图分类号:
| [1]HANG Y P,WANG X,WANG J D,et al.A Survey on Evaluation of Large Language Models [J].arXiv:2310.19736,2023. [2]WANG A,SINGH A,MICHAEL J,et al.GLUE:A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding[J],arXiv:1804.07461,2018. [3]WANG A,PRUKSACHATKUN Y,NANGIA N,et al.SuperGLUE:A Stickier Benchmark for General-Purpose Language Understanding Systems[J].arXiv:1905.00537,2019. [4]ZHONG W J,CUI R X,GUO Y D,et al.AGIEval:A Human-Centric Benchmark for Evaluating Foundation Models [J].ar-Xiv:2304.06364,2023. [5]DAN H,COLLIN B,STEVEN B,et al.Measuring MassiveMultitask Language Understanding [J].arXiv:2009.03300,2021. [6]SRIVASTAVA A,RASTOGI A,RAO A,et al.Beyond the Imi-tation Game:Quantifying and extrapolating the capabilities of language models [J].arXiv:2206.04615,2023. [7]HUANG Y Z,BAI Y Z,ZHU Z H,et al.C-Eval:A Multi-LevelMulti-Discipline Chinese Evaluation Suite for Foundation Mo-dels[J].arXiv:2305.08322,2023. [8]ZENG H.Measuring Massive Multitask Chinese Understanding [J].arXiv:2304.12986,2023. [9]RAJ S S,KUNAL C,DHEERAJ E,et al.When Flue MeetsFlang:Benchmarks and Large Pre-trained Language Model for Financial Domain [J].arXiv:2211.00083,2022. [10]ZHANG L W,CAI W G,LIU Z W,et al.FinEval:A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models [J].arXiv:2308.09975,2023. [11]LEI Y,LI J T,CHENG D W,et al.CFBenchmark:Chinese Financial Assistant Benchmark for Large Language Model [J].arXiv:2311.05812v2,2024. [12]FEI Z W,SHEN X Y,ZHU D W,et al.LawBench:Benchmar-king Legal Knowledge ofLarge Language Models [J].arXiv:2309.16289,2023. [13]LIU M X,HU W G,DING J R,et al.MedBench:A Comprehensive,Standardized,and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [J].arXiv:2407.10990,2024. [14]BAI J Z,BAI S,CHU Y F,et al.Qwen technical report [J].arXiv:2309.16609,2023. [15]YANG A Y,XIAO B,WANG B N,et al.Baichuan 2:Open large-scale language models [J].arXiv:2309.10305,2023. |
|
||