计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241000129-7.doi: 10.11896/jsjkx.241000129

• 人工智能 • 上一篇    下一篇

大模型交叉测评方法研究

梁秉豪, 张传刚, 袁明明   

  1. 浪潮通信信息系统有限公司 济南 250013
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 张传刚(zhangchg@inspur.com)
  • 作者简介:liangbinghao@inspur.com
  • 基金资助:
    泰山产业领军人才项目(tscx202312006);山东省博士后创新项目(SDCX-ZG-202400307)

Research on Cross-Evaluation Method of Large Model

LIANG Binghao, ZHANG Chuangang, YUAN Mingming   

  1. Inspur Communication Information System Co.,Ltd.,Jinan 250013,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    Taishan Industrial Leading Talent Project(tscx202312006) and Shandong Postdoctoral Innovation Project(SDCX-ZG-202400307).

摘要: 随着ChatGPT的出现,大模型已经成为全球科技竞争的新赛道,并开始广泛应用于生产和生活的各个环节。国内众多科技公司纷纷投入到大模型研发和开源工作中。大模型应用场景不断拓展,可供下载或调用的预训练大模型类型和数量越来越多,用户对于大模型测评的需求逐渐增加。目前面向大模型测评还未形成标准化的方法,业界主要通过第三方机构提供的测评榜单对大模型能力进行横向对比。大模型在特定应用场景下的实际效果仍缺少有效的测评手段。文章针对预训练大模型在垂直行业场景下的应用效果测评,特别是面向开放性问题的回答能力进行研究,提出了一套交叉测评方法,并对其可靠性和鲁棒性进行了实验验证。实验结果表明,所提交叉测评方法测评结果与官方给出结果一致性较高,说明该方法具有较强的可靠性。所提方法有效提高了大模型测评结果的客观性和便捷性,有助于用户在个性化场景中快速完成大模型的横向对比和选型。

关键词: 测评方法, 交叉测评, 开放性问题, 待测评大模型, 裁判员大模型

Abstract: With the emergence of ChatGPT,large model have become a new track for global technology competition,and have begun to be widely used in all aspects of production and life.Many domestic technology companies have invested in large model research and development and open source work.As the application scenarios of large model continue to expand,there are more and more types and quantities of pre-trained large model that can be downloaded or invoked,and users’ demand for large model eva-luation is gradually increasing.At present,there is no standardized method for the evaluation of large model,and the industry mainly compares the capability of large models through the evaluation lists provided by third-party institutions.There is still a lack of effective measurement methods for the actual effect of large models in specific application scenarios.In this paper,a cross evaluation method is proposed to evaluate the application effect of the pre-trained large model in the vertical industry scenario,especially the answering ability of open questions,and its reliability and robustness are verified by experiments.The cross-evaluation method proposed in this paper has a high consistency with the official results,indicating that the method has a strong reliability.This method effectively improves the objectivity and convenience of large model evaluation,and helps users to quickly complete the horizontal comparison and selection of large models in personalized scenes.

Key words: Evaluation method, Cross evaluation, Open-ended question, Candidate large model, Judge large model

中图分类号: 

  • TP311
[1]HANG Y P,WANG X,WANG J D,et al.A Survey on Evaluation of Large Language Models [J].arXiv:2310.19736,2023.
[2]WANG A,SINGH A,MICHAEL J,et al.GLUE:A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding[J],arXiv:1804.07461,2018.
[3]WANG A,PRUKSACHATKUN Y,NANGIA N,et al.SuperGLUE:A Stickier Benchmark for General-Purpose Language Understanding Systems[J].arXiv:1905.00537,2019.
[4]ZHONG W J,CUI R X,GUO Y D,et al.AGIEval:A Human-Centric Benchmark for Evaluating Foundation Models [J].ar-Xiv:2304.06364,2023.
[5]DAN H,COLLIN B,STEVEN B,et al.Measuring MassiveMultitask Language Understanding [J].arXiv:2009.03300,2021.
[6]SRIVASTAVA A,RASTOGI A,RAO A,et al.Beyond the Imi-tation Game:Quantifying and extrapolating the capabilities of language models [J].arXiv:2206.04615,2023.
[7]HUANG Y Z,BAI Y Z,ZHU Z H,et al.C-Eval:A Multi-LevelMulti-Discipline Chinese Evaluation Suite for Foundation Mo-dels[J].arXiv:2305.08322,2023.
[8]ZENG H.Measuring Massive Multitask Chinese Understanding [J].arXiv:2304.12986,2023.
[9]RAJ S S,KUNAL C,DHEERAJ E,et al.When Flue MeetsFlang:Benchmarks and Large Pre-trained Language Model for Financial Domain [J].arXiv:2211.00083,2022.
[10]ZHANG L W,CAI W G,LIU Z W,et al.FinEval:A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models [J].arXiv:2308.09975,2023.
[11]LEI Y,LI J T,CHENG D W,et al.CFBenchmark:Chinese Financial Assistant Benchmark for Large Language Model [J].arXiv:2311.05812v2,2024.
[12]FEI Z W,SHEN X Y,ZHU D W,et al.LawBench:Benchmar-king Legal Knowledge ofLarge Language Models [J].arXiv:2309.16289,2023.
[13]LIU M X,HU W G,DING J R,et al.MedBench:A Comprehensive,Standardized,and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [J].arXiv:2407.10990,2024.
[14]BAI J Z,BAI S,CHU Y F,et al.Qwen technical report [J].arXiv:2309.16609,2023.
[15]YANG A Y,XIAO B,WANG B N,et al.Baichuan 2:Open large-scale language models [J].arXiv:2309.10305,2023.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!