Computer Science ›› 2025, Vol. 52 ›› Issue (3): 239-247.doi: 10.11896/jsjkx.240900123

• Artificial Intelligence • Previous Articles     Next Articles

Study on Evaluation Framework of Large Language Model’s Financial Scenario Capability

CHENG Dawei1,2,3, WU Jiaxuan1, LI Jiangtong1, DING Zhijun1,2,3, JIANG Changjun1,2,3   

  1. 1 College of Computer Science and Technology,Tongji University,Shanghai 201804,China
    2 Shanghai Artificial Intelligence Laboratory,Shanghai 200030,China
    3 National Collaborative Innovation Center for Internet Financial Security,Shanghai 201804,China
  • Received:2024-09-20 Revised:2024-11-02 Online:2025-03-15 Published:2025-03-07
  • About author:CHENG Dawei,born in 1987,Ph.D,associate professor,Ph.D supervisor,is a senior member of CCF(No.S5746M).His main research interests include graph learning,big data computing,data mining and machine learning.
    JIANG Changjun,born in 1962,Ph.D,professor,is an academician of Chinese Academy of Engineering,China and an IET fellow.His main research interests include computer science and security of network finance.
  • Supported by:
    National Key R&D Program of China(2022YFB4501704),National Natural Science Foundation of China(62102287,62472317) and Shanghai Science and Technology Innovation Action Plan Project(24692118300,22YS1400600).

Abstract: With the rapid development of large language models(LLMs),its application in the financial sector has become a dri-ving force for industry transformation.Establishing a standardized and systematic evaluation framework for financial capabilities is a crucial way to assess large language models’ abilities in financial scenarios.However,current evaluation methods have limitations,such as weak generalization of evaluation datasets and narrow coverage of task scenarios.To address these issues,this paper proposes a financial large language model benchmark,named CFBenchmark,which consists of four core assessment modules:financial natural language processing,financial scenario computation,financial analysis and interpretation,and financial compliance and security.High-quality tasks and systematic evaluation metrics are designed based on multi-task scenarios within each module,providing a standardized and systematic approach to assessing large models in the financial domain.Experimental results indicate that the performance of large language models in financial scenarios is closely related to their parameters,architecture,and trai-ning process.As the application of LLMs in the financial sector becomes more widespread in the future,the financial LLM benchmark will need to include more real-world application designs and high-quality evaluation data collection to help enhance the generalization ability of LLMs across diverse financial scenarios.

Key words: Large language model benchmark, Financial large language model, Financial scenario computation, Financial analysis and interpretation, Financial compliance and security

CLC Number: 

  • TP391
[1]SON G,JUNG H,HAHM M,et al.Beyond classification:Financial reasoning in state-of-the-art language models[J].arXiv:2305.01505,2023.
[2]NISZCZOTA P,ABBAS S.GPT has become financially literate:Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice [J].arXiv:2309.00649,2023.
[3]NI X,LI P,LI H.Unified text structuralization with instruction-tuned language models[J].arXiv:2303.14956,2023.
[4]ABDALJALIL S,BOUAMOR H.An exploration of automatic text summarization of financial reports[C]//Proceedings of the Third Workshop on Financial Technology and Natural Language Processing.2021:1-7.
[5]YEPES A J,YOU Y,MILCZEK J,et al.Financial Report Chunking for Effective Retrieval Augmented Generation[J].arXiv:2402.05131,2024.
[6]KHANNA U,GHODRATNAMA S,MOLLA D,et al.Transformer-based models for long document summarisation in financial domain[C]//Financial Narrative Processing Workshop(4th:2022).European Language Resources Association(ELRA),2022:73-78.
[7]WANG D,RAMAN N,SIBUE M,et al.DocLLM:A Layout-Aware Generative Language Model for Multimodal Document Understanding[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),ACL 2024,Bangkok,Thailand,August 11-16,2024.Association for Computational Linguistics,2024:8529-8548.
[8]DENG X,BASHLOVKINA V,HAN F,et al.What do llmsknow about financial markets? a case study on reddit market sentiment analysis[C]//Companion Proceedings of the ACM Web Conference 2023.2023:107-110.
[9]VAMOSSY D F,SKOG R.Emtract:Extracting emotions fromsocial media[J].arXiv:2112.03868,2021.
[10]JEONG C.Fine-tuning and utilization methods of domain-speci-fic llms[J].arXiv:2401.02981,2024.
[11]BĂROIU A C,TRĂUŞAN-MATU Ş.How capable are state-of-the-art language models to cope with sarcasm?[C]//2023 24th International Conference on Control Systems and Computer Science(CSCS).IEEE,2023:399-402.
[12]WU S,IRSOY O,LU S,et al.Bloomberggpt:A large language model for finance[J].arXiv:2303.17564,2023.
[13]KIM A,MUHN M,NIKOLAEV V V.Bloated disclosures:can ChatGPT help investors process information?[J].arXiv:2306.10224,2023.
[14]BHATIA G,NAGOUDI E M B,CAVUSOGLU H,et al.Fin-tral:A family of gpt-4 level multimodal financial large language models[J].arXiv:2402.10986,2024.
[15]LEIPPOLD M.Sentiment spin:Attacking financial sentimentwith GPT-3[J].Finance Research Letters,2023,55:103957.
[16]JIANG Y,PAN Z,ZHANG X,et al.Empowering time seriesanalysis with large language models:A survey[J].arXiv:2402.03182,2024.
[17]ZHANG Z,SUN Y,WANG Z,et al.Large language models for mobility in transportation systems:A survey on forecasting tasks[J].arXiv:2405.02357,2024.
[18]JIN M,ZHANGY F,CHEN W,et al.Position:What Can Large Language Models Tell Us about Time Series Analysis[C]//Proceedings of the Forty-first International Conference on Machine Learning.2024.
[19]PAN Z,JIANG Y,GARG S,et al.S2 IP-LLM:Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting[C]//Forty-first International Conference on Machine Learning.2024.
[20]GRUVER N,FINZI M,QIU S,et al.Large language models are zero-shot time series forecasters [J].arXiv:2310.07820,2023.
[21]HENDRYCKS D,BURNS C,BASART S,et al.Measuring Massive Multitask Language Understanding[C]//Proceedings of the International Conference on Learning Representations(ICLR).2021.
[22]WELBL J,LIU N F,GARDNER M.Crowdsourcing MultipleChoice Science Questions[C]//Proceedings of the 3rd Workshop on Noisy User-generated Text,NUT@EMNLP 2017,Copenhagen,Denmark,September 7,2017.Association for Computational Linguistics,2017:94-106.
[23]ZHONG W J,CUI R X,GUO Y D,et al.AGIEval:A Human-Centric Benchmark for Evaluating Foundation Models[C]//Findings of the Association for Computational Linguistics:NAACL 2024,Mexico City,Mexico,June 16-21,2024.Association for Computational Linguistics,2024:2299-2314.
[24]ZHONG M,YIN D,YU T,et al.QMSum:A New Benchmark for Query-based Multi-domain Meeting Summarization[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,NAACL-HLT 2021,Online,June 6-11,2021.Association for Computational Linguistics,2021:5905-5921.
[25]SHAH R S,CHAWLA K,EIDNANI D,et al.When flue meets flang:Benchmarks and large pre-trained language model for financial domain[J].arXiv:2211.00083,2022.
[26]XIE Q,HAN W,ZHANG X,et al.Pixiu:A large language mo-del,instruction data and evaluation benchmark for finance[J].arXiv:2306.05443,2023.
[27]LU D,WU H,LIANG J,et al.Bbt-fin:Comprehensive construction of chinese financial domain pre-trained language model,corpus and benchmark[J].arXiv:2302.09432,2023.
[28]ZHANG L,CAI W,LIU Z,et al.Fineval:A chinese financial domain knowledge evaluation benchmark for large language mo-dels[J].arXiv:2308.09975,2023.
[29]LI J,LEI Y,BIAN Y,et al.RA-CFGPT:Chinese financial assistant with retrieval-augmented large language mode[J].Frontiers of Computer Science,2024,18(5):185350.
[30]ZENG W,REN X,SU T,et al.PanGu-α:Large-scale autoregressive pretrained Chinese language models with auto-parallel computation[J].arXiv:2104.12369,2021.
[31]LIANG X,CHENG D,YANG F,et al.F-HMTC:Detecting Financial Events for Investment Decisions Based on Neural Hierarchical Multi-Label Text Classification[C]//IJCAI.2020:4490-4496.
[32]KOJIMA T,GU S S,REID M,et al.Large language models are zero-shot reasoners[J].Advances in Neural Information Processing Systems,2022,35:22199-22213.
[33]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[34]SUN Y,WANG S,FENG S,et al.Ernie 3.0:Large-scale knowledge enhanced pre-training for language understanding and generation[J].arXiv:2107.02137,2021.
[35]YANG A,YANG B,HUI B,et al.Qwen2 technical report[J].arXiv:2407.10671,2024.
[36]YANG A,XIAO B,WANG B,et al.Baichuan 2:Open large-scale language models[J].arXiv:2309.10305,2023.
[37]ZENG A,LIU X,DU Z,et al.GLM-130B:An Open Bilingual Pre-trained Mode[C]//the Eleventh International Conference on Learning Representations(ICLR 2023),Kigali,Rwanda,May 1-5.2023.
[38]DU Z,QIAN Y,LIU X,et al.GLM:General Language Model Pretraining with Autoregressive Blank Infilling[C]//Procee-dings of the Annual Meeting of the Association for Computa-tional Linguistics.2021:320-335.
[39]TEAM I L M.Internlm:A multilingual language model withprogressively enhanced capabilities[J/OL].(2023-01-06)[2023-09-27].https://github.com/InternLM/InternLM,2023.
[1] LIU Lilong, LIU Guoming, QI Baoyuan, DENG Xueshan, XUE Dizhan, QIAN Shengsheng. Efficient Inference Techniques of Large Models in Real-world Applications:A Comprehensive Survey [J]. Computer Science, 2026, 53(1): 12-28.
[2] LIU Dayong, DONG Zhiming, GUO Qisheng, GAO Ang, QIU Xuehuan. Research on Architecture and Technology Pathways for Empowering Tactical AdversarialSimulation Experiments with LLMs [J]. Computer Science, 2026, 53(1): 39-50.
[3] WANG Cheng, JIN Cheng. KAN-based Unsupervised Multivariate Time Series Anomaly Detection Network [J]. Computer Science, 2026, 53(1): 89-96.
[4] LIU Hongjian, ZOU Danping, LI Ping. Pedestrian Trajectory Prediction Method Based on Graph Attention Interaction [J]. Computer Science, 2026, 53(1): 97-103.
[5] XUE Jingyan, XIA Jianan, HUO Ruili, LIU Jie, ZHOU Xuezhong. Review of Retinal Image Analysis Methods for OCT/OCTA Based on Deep Learning [J]. Computer Science, 2026, 53(1): 128-140.
[6] ZHAO Wenhao, MEI Meng, WANG Xiaoping, LUO Hangyu. PKHOI:Enhancing Human-Object Interaction Detection Algorithms with Prior Knowledge [J]. Computer Science, 2026, 53(1): 141-152.
[7] ZHOU Bingquan, JIANG Jie, CHEN Jiangmin, ZHAN Lixin. EvR-DETR:Event-RGB Fusion for Lightweight End-to-End Object Detection [J]. Computer Science, 2026, 53(1): 153-162.
[8] LI Fangfang, KONG Yuqiu, LIU Yang , LI Pengyue. Co-salient Object Detection Guided by Category Labels [J]. Computer Science, 2026, 53(1): 163-172.
[9] LI Ang, ZHANG Jieyuan, LIU Xunyun. Camouflaged Object Detection for Aerial Images Based on Bidirectional Cross-attentionCross-domain Fusion [J]. Computer Science, 2026, 53(1): 173-179.
[10] BU Yunyang, QI Binting, BU Fanliang. Multimodal Sentiment Analysis for Interactive Fusion of Dual Perspectives Under Cross-modalInconsistent Perception [J]. Computer Science, 2026, 53(1): 187-194.
[11] LYU Jinggang, GAO Shuo, LI Yuzhi, ZHOU Jin. Facial Expression Recognition with Channel Attention Guided Global-Local Semantic Cooperation [J]. Computer Science, 2026, 53(1): 195-205.
[12] CHEN Jiwei, CHEN Zebin, TAN Guang. Visual Floorplan Localization Based on BEV Perception [J]. Computer Science, 2026, 53(1): 216-223.
[13] KALZANG Gyatso, NYIMA Tashi, QUN Nuo, GAMA Tashi, DORJE Tashi, LOBSANG Yeshi, LHAMO Kyi, ZOM Kyi. Data Augmentation Methods for Tibetan-Chinese Machine Translation Based on Long-tail Words [J]. Computer Science, 2026, 53(1): 224-230.
[14] JIA Jingdong, HOU Xin, WANG Zhe, HUANG Jian. Research on User Data-driven App Fading Functions [J]. Computer Science, 2026, 53(1): 262-270.
[15] CHEN Zhuangzhuang, DENG Yichen, YU Dunhui, XIAO Kui. Cross-language Knowledge Graph Entity Alignment Based on Meta-learning [J]. Computer Science, 2026, 53(1): 271-277.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!