大模型交叉测评方法研究

doi:10.11896/jsjkx.241000129

Abstract

Abstract: With the emergence of ChatGPT,large model have become a new track for global technology competition,and have begun to be widely used in all aspects of production and life.Many domestic technology companies have invested in large model research and development and open source work.As the application scenarios of large model continue to expand,there are more and more types and quantities of pre-trained large model that can be downloaded or invoked,and users’ demand for large model eva-luation is gradually increasing.At present,there is no standardized method for the evaluation of large model,and the industry mainly compares the capability of large models through the evaluation lists provided by third-party institutions.There is still a lack of effective measurement methods for the actual effect of large models in specific application scenarios.In this paper,a cross evaluation method is proposed to evaluate the application effect of the pre-trained large model in the vertical industry scenario,especially the answering ability of open questions,and its reliability and robustness are verified by experiments.The cross-evaluation method proposed in this paper has a high consistency with the official results,indicating that the method has a strong reliability.This method effectively improves the objectivity and convenience of large model evaluation,and helps users to quickly complete the horizontal comparison and selection of large models in personalized scenes.

Key words: Evaluation method, Cross evaluation, Open-ended question, Candidate large model, Judge large model

CLC Number:

TP311

LIANG Binghao, ZHANG Chuangang, YUAN Mingming. Research on Cross-Evaluation Method of Large Model[J].Computer Science, 2025, 52(11A): 241000129-7.

References

[1]HANG Y P,WANG X,WANG J D,et al.A Survey on Evaluation of Large Language Models [J].arXiv:2310.19736,2023.
[2]WANG A,SINGH A,MICHAEL J,et al.GLUE:A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding[J],arXiv:1804.07461,2018.
[3]WANG A,PRUKSACHATKUN Y,NANGIA N,et al.SuperGLUE:A Stickier Benchmark for General-Purpose Language Understanding Systems[J].arXiv:1905.00537,2019.
[4]ZHONG W J,CUI R X,GUO Y D,et al.AGIEval:A Human-Centric Benchmark for Evaluating Foundation Models [J].ar-Xiv:2304.06364,2023.
[5]DAN H,COLLIN B,STEVEN B,et al.Measuring MassiveMultitask Language Understanding [J].arXiv:2009.03300,2021.
[6]SRIVASTAVA A,RASTOGI A,RAO A,et al.Beyond the Imi-tation Game:Quantifying and extrapolating the capabilities of language models [J].arXiv:2206.04615,2023.
[7]HUANG Y Z,BAI Y Z,ZHU Z H,et al.C-Eval:A Multi-LevelMulti-Discipline Chinese Evaluation Suite for Foundation Mo-dels[J].arXiv:2305.08322,2023.
[8]ZENG H.Measuring Massive Multitask Chinese Understanding [J].arXiv:2304.12986,2023.
[9]RAJ S S,KUNAL C,DHEERAJ E,et al.When Flue MeetsFlang:Benchmarks and Large Pre-trained Language Model for Financial Domain [J].arXiv:2211.00083,2022.
[10]ZHANG L W,CAI W G,LIU Z W,et al.FinEval:A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models [J].arXiv:2308.09975,2023.
[11]LEI Y,LI J T,CHENG D W,et al.CFBenchmark:Chinese Financial Assistant Benchmark for Large Language Model [J].arXiv:2311.05812v2,2024.
[12]FEI Z W,SHEN X Y,ZHU D W,et al.LawBench:Benchmar-king Legal Knowledge ofLarge Language Models [J].arXiv:2309.16289,2023.
[13]LIU M X,HU W G,DING J R,et al.MedBench:A Comprehensive,Standardized,and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [J].arXiv:2407.10990,2024.
[14]BAI J Z,BAI S,CHU Y F,et al.Qwen technical report [J].arXiv:2309.16609,2023.
[15]YANG A Y,XIAO B,WANG B N,et al.Baichuan 2:Open large-scale language models [J].arXiv:2309.10305,2023.

Related Articles 13

[1]	JIAO Ruodan, GAO Donghui, HUANG Yanhua, LIU Shuo, DUAN Xuanfei, WANG Rui, LIU Weidong. Study and Verification on Few-shot Evaluation Methods for AI-based Quality Inspection in Production Lines [J]. Computer Science, 2024, 51(6A): 230700086-8.
[2]	CHEN Chong, CHEN Jie, ZHANG Hui, CAI Lei, XUE Yaru. Review on Interpretability of Deep Learning [J]. Computer Science, 2023, 50(5): 52-63.
[3]	ZHANG Jie-hui, PAN Chao, ZHANG Yong. Network System Risk Assessment Model with Optimal Weights [J]. Computer Science, 2019, 46(6): 148-152.
[4]	YUE Chuan, PENG Xiao-hong. Evaluation Model of Software Quality with Interval Data [J]. Computer Science, 2019, 46(10): 209-214.
[5]	LENG Qiang, YANG Ying-jie, HU Hao. Self-adaption Adjustment Method for Experts in Risk Assessment [J]. Computer Science, 2018, 45(12): 98-103.
[6]	LIU Chang and FAN Bin. Weighted Least Squares Support Vector Machine Based on Entropy Evaluation [J]. Computer Science, 2017, 44(Z11): 428-431.
[7]	LI Hong-jun, CUI Xi-ning, MU Ming and HAN Wei. Research on Distributed Embedded Computer Performance Evaluation Model [J]. Computer Science, 2017, 44(4): 153-156.
[8]	WU Ju-hua, CHENG Xiao-yan, CAO Qiang and MO Zan. Trustworthy Web Servcie Selection Based on Social Network [J]. Computer Science, 2016, 43(1): 141-144.
[9]	YOU Meng-li and LEI Xiu-juan. Study and Application of Evaluating Methods of PPI Network Clustering [J]. Computer Science, 2013, 40(12): 254-258.
[10]	. Load Evaluation Method about Cloud Computing Cluster Based on the Load Grayscale Mapping Model [J]. Computer Science, 2012, 39(3): 23-27.
[11]	WANG Hui-mei,LI Xu,XIAN Ming,WANG Guo-yu. Genetic Projection Pursuit Evaluation Method of Network Attack Resistance Ability [J]. Computer Science, 2010, 37(6): 43-45.
[12]	YANG Xiao-ping,ZHANG Wei-qun,ZHOU Xiang-bing. Component Composition Evaluation Method Based on Grey Correlation [J]. Computer Science, 2009, 36(8): 174-176.
[13]	. [J]. Computer Science, 2007, 34(2): 181-185.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Research on Cross-Evaluation Method of Large Model

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 13

Metrics

Comments

Recommended 0