计算机科学 ›› 2025, Vol. 52 ›› Issue (3): 239-247.doi: 10.11896/jsjkx.240900123

• 人工智能 • 上一篇    下一篇

大模型金融场景能力评测框架研究

程大伟1,2,3, 吴佳璇1, 李江彤1, 丁志军1,2,3, 蒋昌俊1,2,3   

  1. 1 同济大学计算机科学与技术学院 上海 201804
    2 上海人工智能实验室 上海 200030
    3 国家级网络金融安全协同创新中心 上海 201804
  • 收稿日期:2024-09-20 修回日期:2024-11-02 出版日期:2025-03-15 发布日期:2025-03-07
  • 通讯作者: 蒋昌俊(cjjiang@tongji.edu.cn)
  • 作者简介:(dcheng@tongji.edu.cn)
  • 基金资助:
    国家重点研发计划(2022YFB4501704);国家自然科学基金(62102287,62472317);上海市科技创新行动计划项目(24692118300,22YS1400600)

Study on Evaluation Framework of Large Language Model’s Financial Scenario Capability

CHENG Dawei1,2,3, WU Jiaxuan1, LI Jiangtong1, DING Zhijun1,2,3, JIANG Changjun1,2,3   

  1. 1 College of Computer Science and Technology,Tongji University,Shanghai 201804,China
    2 Shanghai Artificial Intelligence Laboratory,Shanghai 200030,China
    3 National Collaborative Innovation Center for Internet Financial Security,Shanghai 201804,China
  • Received:2024-09-20 Revised:2024-11-02 Online:2025-03-15 Published:2025-03-07
  • About author:CHENG Dawei,born in 1987,Ph.D,associate professor,Ph.D supervisor,is a senior member of CCF(No.S5746M).His main research interests include graph learning,big data computing,data mining and machine learning.
    JIANG Changjun,born in 1962,Ph.D,professor,is an academician of Chinese Academy of Engineering,China and an IET fellow.His main research interests include computer science and security of network finance.
  • Supported by:
    National Key R&D Program of China(2022YFB4501704),National Natural Science Foundation of China(62102287,62472317) and Shanghai Science and Technology Innovation Action Plan Project(24692118300,22YS1400600).

摘要: 随着大模型技术的快速发展,其在金融领域的应用已成为推动行业变革的重要力量。构建标准化、系统化的金融能力评测框架是衡量大模型金融场景能力的重要途径,但是现有的评测方法存在评测数据集泛化性弱、任务场景覆盖面窄等缺点。因此,提出了一种面向大模型金融能力的评测框架CFBenchmark,该框架由金融自然语言处理、金融场景计算、金融分析与解读,以及金融合规与安全四大核心评估模块构成,基于模块内的多任务场景设计和系统化评测指标来为金融领域大模型的能力评估提供标准化、系统化的解决途径。实验结果表明,大模型在金融场景下的表现与模型参数、架构和训练过程息息相关,同时大模型在金融合规与安全领域仍有很大改进空间。未来随着大模型在金融领域的应用愈发广泛,大模型金融能力测评框架需完善更多真实场景的任务设计与高质量测评数据的收集,以提升大模型在多样化金融场景下的泛化能力。

关键词: 大模型评测, 金融大模型, 金融场景计算, 金融分析与解读, 金融合规与安全

Abstract: With the rapid development of large language models(LLMs),its application in the financial sector has become a dri-ving force for industry transformation.Establishing a standardized and systematic evaluation framework for financial capabilities is a crucial way to assess large language models’ abilities in financial scenarios.However,current evaluation methods have limitations,such as weak generalization of evaluation datasets and narrow coverage of task scenarios.To address these issues,this paper proposes a financial large language model benchmark,named CFBenchmark,which consists of four core assessment modules:financial natural language processing,financial scenario computation,financial analysis and interpretation,and financial compliance and security.High-quality tasks and systematic evaluation metrics are designed based on multi-task scenarios within each module,providing a standardized and systematic approach to assessing large models in the financial domain.Experimental results indicate that the performance of large language models in financial scenarios is closely related to their parameters,architecture,and trai-ning process.As the application of LLMs in the financial sector becomes more widespread in the future,the financial LLM benchmark will need to include more real-world application designs and high-quality evaluation data collection to help enhance the generalization ability of LLMs across diverse financial scenarios.

Key words: Large language model benchmark, Financial large language model, Financial scenario computation, Financial analysis and interpretation, Financial compliance and security

中图分类号: 

  • TP391
[1]SON G,JUNG H,HAHM M,et al.Beyond classification:Financial reasoning in state-of-the-art language models[J].arXiv:2305.01505,2023.
[2]NISZCZOTA P,ABBAS S.GPT has become financially literate:Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice [J].arXiv:2309.00649,2023.
[3]NI X,LI P,LI H.Unified text structuralization with instruction-tuned language models[J].arXiv:2303.14956,2023.
[4]ABDALJALIL S,BOUAMOR H.An exploration of automatic text summarization of financial reports[C]//Proceedings of the Third Workshop on Financial Technology and Natural Language Processing.2021:1-7.
[5]YEPES A J,YOU Y,MILCZEK J,et al.Financial Report Chunking for Effective Retrieval Augmented Generation[J].arXiv:2402.05131,2024.
[6]KHANNA U,GHODRATNAMA S,MOLLA D,et al.Transformer-based models for long document summarisation in financial domain[C]//Financial Narrative Processing Workshop(4th:2022).European Language Resources Association(ELRA),2022:73-78.
[7]WANG D,RAMAN N,SIBUE M,et al.DocLLM:A Layout-Aware Generative Language Model for Multimodal Document Understanding[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),ACL 2024,Bangkok,Thailand,August 11-16,2024.Association for Computational Linguistics,2024:8529-8548.
[8]DENG X,BASHLOVKINA V,HAN F,et al.What do llmsknow about financial markets? a case study on reddit market sentiment analysis[C]//Companion Proceedings of the ACM Web Conference 2023.2023:107-110.
[9]VAMOSSY D F,SKOG R.Emtract:Extracting emotions fromsocial media[J].arXiv:2112.03868,2021.
[10]JEONG C.Fine-tuning and utilization methods of domain-speci-fic llms[J].arXiv:2401.02981,2024.
[11]BĂROIU A C,TRĂUŞAN-MATU Ş.How capable are state-of-the-art language models to cope with sarcasm?[C]//2023 24th International Conference on Control Systems and Computer Science(CSCS).IEEE,2023:399-402.
[12]WU S,IRSOY O,LU S,et al.Bloomberggpt:A large language model for finance[J].arXiv:2303.17564,2023.
[13]KIM A,MUHN M,NIKOLAEV V V.Bloated disclosures:can ChatGPT help investors process information?[J].arXiv:2306.10224,2023.
[14]BHATIA G,NAGOUDI E M B,CAVUSOGLU H,et al.Fin-tral:A family of gpt-4 level multimodal financial large language models[J].arXiv:2402.10986,2024.
[15]LEIPPOLD M.Sentiment spin:Attacking financial sentimentwith GPT-3[J].Finance Research Letters,2023,55:103957.
[16]JIANG Y,PAN Z,ZHANG X,et al.Empowering time seriesanalysis with large language models:A survey[J].arXiv:2402.03182,2024.
[17]ZHANG Z,SUN Y,WANG Z,et al.Large language models for mobility in transportation systems:A survey on forecasting tasks[J].arXiv:2405.02357,2024.
[18]JIN M,ZHANGY F,CHEN W,et al.Position:What Can Large Language Models Tell Us about Time Series Analysis[C]//Proceedings of the Forty-first International Conference on Machine Learning.2024.
[19]PAN Z,JIANG Y,GARG S,et al.S2 IP-LLM:Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting[C]//Forty-first International Conference on Machine Learning.2024.
[20]GRUVER N,FINZI M,QIU S,et al.Large language models are zero-shot time series forecasters [J].arXiv:2310.07820,2023.
[21]HENDRYCKS D,BURNS C,BASART S,et al.Measuring Massive Multitask Language Understanding[C]//Proceedings of the International Conference on Learning Representations(ICLR).2021.
[22]WELBL J,LIU N F,GARDNER M.Crowdsourcing MultipleChoice Science Questions[C]//Proceedings of the 3rd Workshop on Noisy User-generated Text,NUT@EMNLP 2017,Copenhagen,Denmark,September 7,2017.Association for Computational Linguistics,2017:94-106.
[23]ZHONG W J,CUI R X,GUO Y D,et al.AGIEval:A Human-Centric Benchmark for Evaluating Foundation Models[C]//Findings of the Association for Computational Linguistics:NAACL 2024,Mexico City,Mexico,June 16-21,2024.Association for Computational Linguistics,2024:2299-2314.
[24]ZHONG M,YIN D,YU T,et al.QMSum:A New Benchmark for Query-based Multi-domain Meeting Summarization[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,NAACL-HLT 2021,Online,June 6-11,2021.Association for Computational Linguistics,2021:5905-5921.
[25]SHAH R S,CHAWLA K,EIDNANI D,et al.When flue meets flang:Benchmarks and large pre-trained language model for financial domain[J].arXiv:2211.00083,2022.
[26]XIE Q,HAN W,ZHANG X,et al.Pixiu:A large language mo-del,instruction data and evaluation benchmark for finance[J].arXiv:2306.05443,2023.
[27]LU D,WU H,LIANG J,et al.Bbt-fin:Comprehensive construction of chinese financial domain pre-trained language model,corpus and benchmark[J].arXiv:2302.09432,2023.
[28]ZHANG L,CAI W,LIU Z,et al.Fineval:A chinese financial domain knowledge evaluation benchmark for large language mo-dels[J].arXiv:2308.09975,2023.
[29]LI J,LEI Y,BIAN Y,et al.RA-CFGPT:Chinese financial assistant with retrieval-augmented large language mode[J].Frontiers of Computer Science,2024,18(5):185350.
[30]ZENG W,REN X,SU T,et al.PanGu-α:Large-scale autoregressive pretrained Chinese language models with auto-parallel computation[J].arXiv:2104.12369,2021.
[31]LIANG X,CHENG D,YANG F,et al.F-HMTC:Detecting Financial Events for Investment Decisions Based on Neural Hierarchical Multi-Label Text Classification[C]//IJCAI.2020:4490-4496.
[32]KOJIMA T,GU S S,REID M,et al.Large language models are zero-shot reasoners[J].Advances in Neural Information Processing Systems,2022,35:22199-22213.
[33]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[34]SUN Y,WANG S,FENG S,et al.Ernie 3.0:Large-scale knowledge enhanced pre-training for language understanding and generation[J].arXiv:2107.02137,2021.
[35]YANG A,YANG B,HUI B,et al.Qwen2 technical report[J].arXiv:2407.10671,2024.
[36]YANG A,XIAO B,WANG B,et al.Baichuan 2:Open large-scale language models[J].arXiv:2309.10305,2023.
[37]ZENG A,LIU X,DU Z,et al.GLM-130B:An Open Bilingual Pre-trained Mode[C]//the Eleventh International Conference on Learning Representations(ICLR 2023),Kigali,Rwanda,May 1-5.2023.
[38]DU Z,QIAN Y,LIU X,et al.GLM:General Language Model Pretraining with Autoregressive Blank Infilling[C]//Procee-dings of the Annual Meeting of the Association for Computa-tional Linguistics.2021:320-335.
[39]TEAM I L M.Internlm:A multilingual language model withprogressively enhanced capabilities[J/OL].(2023-01-06)[2023-09-27].https://github.com/InternLM/InternLM,2023.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!