Computer Science ›› 2025, Vol. 52 ›› Issue (3): 239-247.doi: 10.11896/jsjkx.240900123

• Artificial Intelligence • Previous Articles     Next Articles

Study on Evaluation Framework of Large Language Model’s Financial Scenario Capability

CHENG Dawei1,2,3, WU Jiaxuan1, LI Jiangtong1, DING Zhijun1,2,3, JIANG Changjun1,2,3   

  1. 1 College of Computer Science and Technology,Tongji University,Shanghai 201804,China
    2 Shanghai Artificial Intelligence Laboratory,Shanghai 200030,China
    3 National Collaborative Innovation Center for Internet Financial Security,Shanghai 201804,China
  • Received:2024-09-20 Revised:2024-11-02 Online:2025-03-15 Published:2025-03-07
  • About author:CHENG Dawei,born in 1987,Ph.D,associate professor,Ph.D supervisor,is a senior member of CCF(No.S5746M).His main research interests include graph learning,big data computing,data mining and machine learning.
    JIANG Changjun,born in 1962,Ph.D,professor,is an academician of Chinese Academy of Engineering,China and an IET fellow.His main research interests include computer science and security of network finance.
  • Supported by:
    National Key R&D Program of China(2022YFB4501704),National Natural Science Foundation of China(62102287,62472317) and Shanghai Science and Technology Innovation Action Plan Project(24692118300,22YS1400600).

Abstract: With the rapid development of large language models(LLMs),its application in the financial sector has become a dri-ving force for industry transformation.Establishing a standardized and systematic evaluation framework for financial capabilities is a crucial way to assess large language models’ abilities in financial scenarios.However,current evaluation methods have limitations,such as weak generalization of evaluation datasets and narrow coverage of task scenarios.To address these issues,this paper proposes a financial large language model benchmark,named CFBenchmark,which consists of four core assessment modules:financial natural language processing,financial scenario computation,financial analysis and interpretation,and financial compliance and security.High-quality tasks and systematic evaluation metrics are designed based on multi-task scenarios within each module,providing a standardized and systematic approach to assessing large models in the financial domain.Experimental results indicate that the performance of large language models in financial scenarios is closely related to their parameters,architecture,and trai-ning process.As the application of LLMs in the financial sector becomes more widespread in the future,the financial LLM benchmark will need to include more real-world application designs and high-quality evaluation data collection to help enhance the generalization ability of LLMs across diverse financial scenarios.

Key words: Large language model benchmark, Financial large language model, Financial scenario computation, Financial analysis and interpretation, Financial compliance and security

CLC Number: 

  • TP391
[1]SON G,JUNG H,HAHM M,et al.Beyond classification:Financial reasoning in state-of-the-art language models[J].arXiv:2305.01505,2023.
[2]NISZCZOTA P,ABBAS S.GPT has become financially literate:Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice [J].arXiv:2309.00649,2023.
[3]NI X,LI P,LI H.Unified text structuralization with instruction-tuned language models[J].arXiv:2303.14956,2023.
[4]ABDALJALIL S,BOUAMOR H.An exploration of automatic text summarization of financial reports[C]//Proceedings of the Third Workshop on Financial Technology and Natural Language Processing.2021:1-7.
[5]YEPES A J,YOU Y,MILCZEK J,et al.Financial Report Chunking for Effective Retrieval Augmented Generation[J].arXiv:2402.05131,2024.
[6]KHANNA U,GHODRATNAMA S,MOLLA D,et al.Transformer-based models for long document summarisation in financial domain[C]//Financial Narrative Processing Workshop(4th:2022).European Language Resources Association(ELRA),2022:73-78.
[7]WANG D,RAMAN N,SIBUE M,et al.DocLLM:A Layout-Aware Generative Language Model for Multimodal Document Understanding[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),ACL 2024,Bangkok,Thailand,August 11-16,2024.Association for Computational Linguistics,2024:8529-8548.
[8]DENG X,BASHLOVKINA V,HAN F,et al.What do llmsknow about financial markets? a case study on reddit market sentiment analysis[C]//Companion Proceedings of the ACM Web Conference 2023.2023:107-110.
[9]VAMOSSY D F,SKOG R.Emtract:Extracting emotions fromsocial media[J].arXiv:2112.03868,2021.
[10]JEONG C.Fine-tuning and utilization methods of domain-speci-fic llms[J].arXiv:2401.02981,2024.
[11]BĂROIU A C,TRĂUŞAN-MATU Ş.How capable are state-of-the-art language models to cope with sarcasm?[C]//2023 24th International Conference on Control Systems and Computer Science(CSCS).IEEE,2023:399-402.
[12]WU S,IRSOY O,LU S,et al.Bloomberggpt:A large language model for finance[J].arXiv:2303.17564,2023.
[13]KIM A,MUHN M,NIKOLAEV V V.Bloated disclosures:can ChatGPT help investors process information?[J].arXiv:2306.10224,2023.
[14]BHATIA G,NAGOUDI E M B,CAVUSOGLU H,et al.Fin-tral:A family of gpt-4 level multimodal financial large language models[J].arXiv:2402.10986,2024.
[15]LEIPPOLD M.Sentiment spin:Attacking financial sentimentwith GPT-3[J].Finance Research Letters,2023,55:103957.
[16]JIANG Y,PAN Z,ZHANG X,et al.Empowering time seriesanalysis with large language models:A survey[J].arXiv:2402.03182,2024.
[17]ZHANG Z,SUN Y,WANG Z,et al.Large language models for mobility in transportation systems:A survey on forecasting tasks[J].arXiv:2405.02357,2024.
[18]JIN M,ZHANGY F,CHEN W,et al.Position:What Can Large Language Models Tell Us about Time Series Analysis[C]//Proceedings of the Forty-first International Conference on Machine Learning.2024.
[19]PAN Z,JIANG Y,GARG S,et al.S2 IP-LLM:Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting[C]//Forty-first International Conference on Machine Learning.2024.
[20]GRUVER N,FINZI M,QIU S,et al.Large language models are zero-shot time series forecasters [J].arXiv:2310.07820,2023.
[21]HENDRYCKS D,BURNS C,BASART S,et al.Measuring Massive Multitask Language Understanding[C]//Proceedings of the International Conference on Learning Representations(ICLR).2021.
[22]WELBL J,LIU N F,GARDNER M.Crowdsourcing MultipleChoice Science Questions[C]//Proceedings of the 3rd Workshop on Noisy User-generated Text,NUT@EMNLP 2017,Copenhagen,Denmark,September 7,2017.Association for Computational Linguistics,2017:94-106.
[23]ZHONG W J,CUI R X,GUO Y D,et al.AGIEval:A Human-Centric Benchmark for Evaluating Foundation Models[C]//Findings of the Association for Computational Linguistics:NAACL 2024,Mexico City,Mexico,June 16-21,2024.Association for Computational Linguistics,2024:2299-2314.
[24]ZHONG M,YIN D,YU T,et al.QMSum:A New Benchmark for Query-based Multi-domain Meeting Summarization[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,NAACL-HLT 2021,Online,June 6-11,2021.Association for Computational Linguistics,2021:5905-5921.
[25]SHAH R S,CHAWLA K,EIDNANI D,et al.When flue meets flang:Benchmarks and large pre-trained language model for financial domain[J].arXiv:2211.00083,2022.
[26]XIE Q,HAN W,ZHANG X,et al.Pixiu:A large language mo-del,instruction data and evaluation benchmark for finance[J].arXiv:2306.05443,2023.
[27]LU D,WU H,LIANG J,et al.Bbt-fin:Comprehensive construction of chinese financial domain pre-trained language model,corpus and benchmark[J].arXiv:2302.09432,2023.
[28]ZHANG L,CAI W,LIU Z,et al.Fineval:A chinese financial domain knowledge evaluation benchmark for large language mo-dels[J].arXiv:2308.09975,2023.
[29]LI J,LEI Y,BIAN Y,et al.RA-CFGPT:Chinese financial assistant with retrieval-augmented large language mode[J].Frontiers of Computer Science,2024,18(5):185350.
[30]ZENG W,REN X,SU T,et al.PanGu-α:Large-scale autoregressive pretrained Chinese language models with auto-parallel computation[J].arXiv:2104.12369,2021.
[31]LIANG X,CHENG D,YANG F,et al.F-HMTC:Detecting Financial Events for Investment Decisions Based on Neural Hierarchical Multi-Label Text Classification[C]//IJCAI.2020:4490-4496.
[32]KOJIMA T,GU S S,REID M,et al.Large language models are zero-shot reasoners[J].Advances in Neural Information Processing Systems,2022,35:22199-22213.
[33]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[34]SUN Y,WANG S,FENG S,et al.Ernie 3.0:Large-scale knowledge enhanced pre-training for language understanding and generation[J].arXiv:2107.02137,2021.
[35]YANG A,YANG B,HUI B,et al.Qwen2 technical report[J].arXiv:2407.10671,2024.
[36]YANG A,XIAO B,WANG B,et al.Baichuan 2:Open large-scale language models[J].arXiv:2309.10305,2023.
[37]ZENG A,LIU X,DU Z,et al.GLM-130B:An Open Bilingual Pre-trained Mode[C]//the Eleventh International Conference on Learning Representations(ICLR 2023),Kigali,Rwanda,May 1-5.2023.
[38]DU Z,QIAN Y,LIU X,et al.GLM:General Language Model Pretraining with Autoregressive Blank Infilling[C]//Procee-dings of the Annual Meeting of the Association for Computa-tional Linguistics.2021:320-335.
[39]TEAM I L M.Internlm:A multilingual language model withprogressively enhanced capabilities[J/OL].(2023-01-06)[2023-09-27].https://github.com/InternLM/InternLM,2023.
[1] CAO Mingwei, ZHANG Di, PENG Shengjie, LI Ning, ZHAO Haifeng. Survey of Metaverse Technology Development and Applications [J]. Computer Science, 2025, 52(3): 4-16.
[2] SONG Xingnuo, WANG Congyan, CHEN Mingkai. Survey on 3D Scene Reconstruction Techniques in Metaverse [J]. Computer Science, 2025, 52(3): 17-32.
[3] CAO Mingwei, XING Jingjie, CHENG Yifeng, ZHAO Haifeng. LpDepth:Self-supervised Monocular Depth Estimation Based on Laplace Pyramid [J]. Computer Science, 2025, 52(3): 33-40.
[4] WANG Tao, BAI Xuefei, WANG Wenjian. Selective Feature Fusion for 3D CT Image Segmentation of Renal Cancer Based on Edge Enhancement [J]. Computer Science, 2025, 52(3): 41-49.
[5] WANG Xingbo, ZHANG Hao, GAO Hao, ZHAI Mingliang, XIE Jiucheng. Talking Portrait Synthesis Method Based on Regional Saliency and Spatial Feature Extraction [J]. Computer Science, 2025, 52(3): 58-67.
[6] ZHONG Yue, GU Jieming. 3D Reconstruction of Single-view Sketches Based on Attention Mechanism and Contrastive Loss [J]. Computer Science, 2025, 52(3): 77-85.
[7] ZHOU Bowen, LI Yang, WANG Jiabao, MIAO Zhuang, ZHANG Rui. Triplet Interaction Mechanism in Cross-view Geo-localization [J]. Computer Science, 2025, 52(3): 86-94.
[8] LI Zongmin, RONG Guangcai, BAI Yun, XU Chang , XIAN Shiyang. 3D Object Detection with Dynamic Weight Graph Convolution [J]. Computer Science, 2025, 52(3): 104-111.
[9] WANG Yuan, HUO Peng, HAN Yi, CHEN Tun, WANG Xiang, WEN Hui. Survey on Deep Learning-based Meteorological Forecasting Models [J]. Computer Science, 2025, 52(3): 112-126.
[10] OU Guiliang, HE Yulin, ZHANG Manjing, HUANG Zhexue , Philippe FOURNIER-VIGER. Risk Minimization-Based Weighted Naive Bayesian Classifier [J]. Computer Science, 2025, 52(3): 137-151.
[11] LU Haiyang, LIU Xianhui, HOU Wenlong. Negative Sampling Method for Fusing Knowledge Graph [J]. Computer Science, 2025, 52(3): 161-168.
[12] XIONG Keqin, RUAN Sijie, YANG Qianyu, XU Changwei , YUAN Hanning. Mobility Data-driven Location Type Inference Based on Crowd Voting [J]. Computer Science, 2025, 52(3): 169-179.
[13] YANG Yingxiu, CHEN Hongmei, ZHOU Lihua , XIAO Qing. Heterogeneous Graph Attention Network Based on Data Augmentation [J]. Computer Science, 2025, 52(3): 180-187.
[14] TIAN Qing, KANG Lulu, ZHOU Liangyu. Class-incremental Source-free Domain Adaptation Based on Multi-prototype Replay andAlignment [J]. Computer Science, 2025, 52(3): 206-213.
[15] SHEN Yaxin, GAO Lijian , MAO Qirong. Semi-supervised Sound Event Detection Based on Meta Learning [J]. Computer Science, 2025, 52(3): 222-230.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!