大模型金融场景能力评测框架研究

doi:10.11896/jsjkx.240900123

Abstract

Abstract: With the rapid development of large language models(LLMs),its application in the financial sector has become a dri-ving force for industry transformation.Establishing a standardized and systematic evaluation framework for financial capabilities is a crucial way to assess large language models’ abilities in financial scenarios.However,current evaluation methods have limitations,such as weak generalization of evaluation datasets and narrow coverage of task scenarios.To address these issues,this paper proposes a financial large language model benchmark,named CFBenchmark,which consists of four core assessment modules:financial natural language processing,financial scenario computation,financial analysis and interpretation,and financial compliance and security.High-quality tasks and systematic evaluation metrics are designed based on multi-task scenarios within each module,providing a standardized and systematic approach to assessing large models in the financial domain.Experimental results indicate that the performance of large language models in financial scenarios is closely related to their parameters,architecture,and trai-ning process.As the application of LLMs in the financial sector becomes more widespread in the future,the financial LLM benchmark will need to include more real-world application designs and high-quality evaluation data collection to help enhance the generalization ability of LLMs across diverse financial scenarios.

Key words: Large language model benchmark, Financial large language model, Financial scenario computation, Financial analysis and interpretation, Financial compliance and security

CLC Number:

TP391

CHENG Dawei, WU Jiaxuan, LI Jiangtong, DING Zhijun, JIANG Changjun. Study on Evaluation Framework of Large Language Model’s Financial Scenario Capability[J].Computer Science, 2025, 52(3): 239-247.

References

[1]SON G,JUNG H,HAHM M,et al.Beyond classification:Financial reasoning in state-of-the-art language models[J].arXiv:2305.01505,2023.
[2]NISZCZOTA P,ABBAS S.GPT has become financially literate:Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice [J].arXiv:2309.00649,2023.
[3]NI X,LI P,LI H.Unified text structuralization with instruction-tuned language models[J].arXiv:2303.14956,2023.
[4]ABDALJALIL S,BOUAMOR H.An exploration of automatic text summarization of financial reports[C]//Proceedings of the Third Workshop on Financial Technology and Natural Language Processing.2021:1-7.
[5]YEPES A J,YOU Y,MILCZEK J,et al.Financial Report Chunking for Effective Retrieval Augmented Generation[J].arXiv:2402.05131,2024.
[6]KHANNA U,GHODRATNAMA S,MOLLA D,et al.Transformer-based models for long document summarisation in financial domain[C]//Financial Narrative Processing Workshop(4th:2022).European Language Resources Association(ELRA),2022:73-78.
[7]WANG D,RAMAN N,SIBUE M,et al.DocLLM:A Layout-Aware Generative Language Model for Multimodal Document Understanding[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),ACL 2024,Bangkok,Thailand,August 11-16,2024.Association for Computational Linguistics,2024:8529-8548.
[8]DENG X,BASHLOVKINA V,HAN F,et al.What do llmsknow about financial markets? a case study on reddit market sentiment analysis[C]//Companion Proceedings of the ACM Web Conference 2023.2023:107-110.
[9]VAMOSSY D F,SKOG R.Emtract:Extracting emotions fromsocial media[J].arXiv:2112.03868,2021.
[10]JEONG C.Fine-tuning and utilization methods of domain-speci-fic llms[J].arXiv:2401.02981,2024.
[11]BĂROIU A C,TRĂUŞAN-MATU Ş.How capable are state-of-the-art language models to cope with sarcasm?[C]//2023 24th International Conference on Control Systems and Computer Science(CSCS).IEEE,2023:399-402.
[12]WU S,IRSOY O,LU S,et al.Bloomberggpt:A large language model for finance[J].arXiv:2303.17564,2023.
[13]KIM A,MUHN M,NIKOLAEV V V.Bloated disclosures:can ChatGPT help investors process information?[J].arXiv:2306.10224,2023.
[14]BHATIA G,NAGOUDI E M B,CAVUSOGLU H,et al.Fin-tral:A family of gpt-4 level multimodal financial large language models[J].arXiv:2402.10986,2024.
[15]LEIPPOLD M.Sentiment spin:Attacking financial sentimentwith GPT-3[J].Finance Research Letters,2023,55:103957.
[16]JIANG Y,PAN Z,ZHANG X,et al.Empowering time seriesanalysis with large language models:A survey[J].arXiv:2402.03182,2024.
[17]ZHANG Z,SUN Y,WANG Z,et al.Large language models for mobility in transportation systems:A survey on forecasting tasks[J].arXiv:2405.02357,2024.
[18]JIN M,ZHANGY F,CHEN W,et al.Position:What Can Large Language Models Tell Us about Time Series Analysis[C]//Proceedings of the Forty-first International Conference on Machine Learning.2024.
[19]PAN Z,JIANG Y,GARG S,et al.S² IP-LLM:Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting[C]//Forty-first International Conference on Machine Learning.2024.
[20]GRUVER N,FINZI M,QIU S,et al.Large language models are zero-shot time series forecasters [J].arXiv:2310.07820,2023.
[21]HENDRYCKS D,BURNS C,BASART S,et al.Measuring Massive Multitask Language Understanding[C]//Proceedings of the International Conference on Learning Representations(ICLR).2021.
[22]WELBL J,LIU N F,GARDNER M.Crowdsourcing MultipleChoice Science Questions[C]//Proceedings of the 3rd Workshop on Noisy User-generated Text,NUT@EMNLP 2017,Copenhagen,Denmark,September 7,2017.Association for Computational Linguistics,2017:94-106.
[23]ZHONG W J,CUI R X,GUO Y D,et al.AGIEval:A Human-Centric Benchmark for Evaluating Foundation Models[C]//Findings of the Association for Computational Linguistics:NAACL 2024,Mexico City,Mexico,June 16-21,2024.Association for Computational Linguistics,2024:2299-2314.
[24]ZHONG M,YIN D,YU T,et al.QMSum:A New Benchmark for Query-based Multi-domain Meeting Summarization[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,NAACL-HLT 2021,Online,June 6-11,2021.Association for Computational Linguistics,2021:5905-5921.
[25]SHAH R S,CHAWLA K,EIDNANI D,et al.When flue meets flang:Benchmarks and large pre-trained language model for financial domain[J].arXiv:2211.00083,2022.
[26]XIE Q,HAN W,ZHANG X,et al.Pixiu:A large language mo-del,instruction data and evaluation benchmark for finance[J].arXiv:2306.05443,2023.
[27]LU D,WU H,LIANG J,et al.Bbt-fin:Comprehensive construction of chinese financial domain pre-trained language model,corpus and benchmark[J].arXiv:2302.09432,2023.
[28]ZHANG L,CAI W,LIU Z,et al.Fineval:A chinese financial domain knowledge evaluation benchmark for large language mo-dels[J].arXiv:2308.09975,2023.
[29]LI J,LEI Y,BIAN Y,et al.RA-CFGPT:Chinese financial assistant with retrieval-augmented large language mode[J].Frontiers of Computer Science,2024,18(5):185350.
[30]ZENG W,REN X,SU T,et al.PanGu-α:Large-scale autoregressive pretrained Chinese language models with auto-parallel computation[J].arXiv:2104.12369,2021.
[31]LIANG X,CHENG D,YANG F,et al.F-HMTC:Detecting Financial Events for Investment Decisions Based on Neural Hierarchical Multi-Label Text Classification[C]//IJCAI.2020:4490-4496.
[32]KOJIMA T,GU S S,REID M,et al.Large language models are zero-shot reasoners[J].Advances in Neural Information Processing Systems,2022,35:22199-22213.
[33]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[34]SUN Y,WANG S,FENG S,et al.Ernie 3.0:Large-scale knowledge enhanced pre-training for language understanding and generation[J].arXiv:2107.02137,2021.
[35]YANG A,YANG B,HUI B,et al.Qwen2 technical report[J].arXiv:2407.10671,2024.
[36]YANG A,XIAO B,WANG B,et al.Baichuan 2:Open large-scale language models[J].arXiv:2309.10305,2023.
[37]ZENG A,LIU X,DU Z,et al.GLM-130B:An Open Bilingual Pre-trained Mode[C]//the Eleventh International Conference on Learning Representations(ICLR 2023),Kigali,Rwanda,May 1-5.2023.
[38]DU Z,QIAN Y,LIU X,et al.GLM:General Language Model Pretraining with Autoregressive Blank Infilling[C]//Procee-dings of the Annual Meeting of the Association for Computa-tional Linguistics.2021:320-335.
[39]TEAM I L M.Internlm:A multilingual language model withprogressively enhanced capabilities[J/OL].(2023-01-06)[2023-09-27].https://github.com/InternLM/InternLM,2023.

Related Articles 15

[1]	CAO Mingwei, ZHANG Di, PENG Shengjie, LI Ning, ZHAO Haifeng. Survey of Metaverse Technology Development and Applications [J]. Computer Science, 2025, 52(3): 4-16.
[2]	SONG Xingnuo, WANG Congyan, CHEN Mingkai. Survey on 3D Scene Reconstruction Techniques in Metaverse [J]. Computer Science, 2025, 52(3): 17-32.
[3]	CAO Mingwei, XING Jingjie, CHENG Yifeng, ZHAO Haifeng. LpDepth:Self-supervised Monocular Depth Estimation Based on Laplace Pyramid [J]. Computer Science, 2025, 52(3): 33-40.
[4]	WANG Tao, BAI Xuefei, WANG Wenjian. Selective Feature Fusion for 3D CT Image Segmentation of Renal Cancer Based on Edge Enhancement [J]. Computer Science, 2025, 52(3): 41-49.
[5]	WANG Xingbo, ZHANG Hao, GAO Hao, ZHAI Mingliang, XIE Jiucheng. Talking Portrait Synthesis Method Based on Regional Saliency and Spatial Feature Extraction [J]. Computer Science, 2025, 52(3): 58-67.
[6]	ZHONG Yue, GU Jieming. 3D Reconstruction of Single-view Sketches Based on Attention Mechanism and Contrastive Loss [J]. Computer Science, 2025, 52(3): 77-85.
[7]	ZHOU Bowen, LI Yang, WANG Jiabao, MIAO Zhuang, ZHANG Rui. Triplet Interaction Mechanism in Cross-view Geo-localization [J]. Computer Science, 2025, 52(3): 86-94.
[8]	LI Zongmin, RONG Guangcai, BAI Yun, XU Chang , XIAN Shiyang. 3D Object Detection with Dynamic Weight Graph Convolution [J]. Computer Science, 2025, 52(3): 104-111.
[9]	WANG Yuan, HUO Peng, HAN Yi, CHEN Tun, WANG Xiang, WEN Hui. Survey on Deep Learning-based Meteorological Forecasting Models [J]. Computer Science, 2025, 52(3): 112-126.
[10]	OU Guiliang, HE Yulin, ZHANG Manjing, HUANG Zhexue , Philippe FOURNIER-VIGER. Risk Minimization-Based Weighted Naive Bayesian Classifier [J]. Computer Science, 2025, 52(3): 137-151.
[11]	LU Haiyang, LIU Xianhui, HOU Wenlong. Negative Sampling Method for Fusing Knowledge Graph [J]. Computer Science, 2025, 52(3): 161-168.
[12]	XIONG Keqin, RUAN Sijie, YANG Qianyu, XU Changwei , YUAN Hanning. Mobility Data-driven Location Type Inference Based on Crowd Voting [J]. Computer Science, 2025, 52(3): 169-179.
[13]	YANG Yingxiu, CHEN Hongmei, ZHOU Lihua , XIAO Qing. Heterogeneous Graph Attention Network Based on Data Augmentation [J]. Computer Science, 2025, 52(3): 180-187.
[14]	TIAN Qing, KANG Lulu, ZHOU Liangyu. Class-incremental Source-free Domain Adaptation Based on Multi-prototype Replay andAlignment [J]. Computer Science, 2025, 52(3): 206-213.
[15]	SHEN Yaxin, GAO Lijian , MAO Qirong. Semi-supervised Sound Event Detection Based on Meta Learning [J]. Computer Science, 2025, 52(3): 222-230.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Study on Evaluation Framework of Large Language Model’s Financial Scenario Capability

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0