大语言模型服务系统服务级目标和系统级指标优化

doi:10.11896/jsjkx.250900173

计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 23-32.doi: 10.11896/jsjkx.250900173

• 基于AGI技术的智能信息系统 • 上一篇下一篇

大语言模型服务系统服务级目标和系统级指标优化

王智彬¹, 李世鹏^1,2, 周宇航¹, 李雪², 张中辉¹, 蒋智威¹, 顾荣¹, 田臣¹, 陈贵海¹, 仲盛¹

1 计算机软件新技术全国重点实验室(南京大学) 南京 210023
2 阿里巴巴集团杭州 310000

收稿日期:2025-09-29 修回日期:2025-11-29 发布日期:2026-03-12
通讯作者: 顾荣(gurong@nju.edu.cn)
作者简介:(wzbwangzhibin@gmail.com)
基金资助:
南京“U35强基项目”(U(2024)001);国家自然科学基金(61872176,62272215,62325205,62172204);江苏省自然科学基金领先技术计划(BK20202001);国家重点研发计划(2020YFB1005900);江苏省自然科学基金重点计划(BK20243053)

Optimization of Service Level Objectives and System Level Metrics in Large Language ModelServing System

WANG Zhibin¹, LI Shipeng^1,2, ZHOU Yuhang¹, LI Xue², ZHANG Zhonghui¹, JIANG Zhiwei¹, GU Rong¹, TIAN Chen¹, CHEN Guihai¹, ZHONG Sheng¹

1 State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210023, China
2 Alibaba Group, Hangzhou 310000, China

Received:2025-09-29 Revised:2025-11-29 Online:2026-03-12
About author:WANG Zhibin,born in 1996,Ph.D,is a member of CCF(No.62267M).His main research interests include graph computing,mining learning,and machine learning system.
GU Rong,born in 1988,Ph.D,associate professor,Ph.D supervisor,is a member of CCF(No.32327S).His main research interests include cloud and big data computing systems,distributed AI training and inference system,intelligent data mana-gement system.
Supported by:
Nanjing “U35” Talent Cultivation Program(U(2024)001),the National Natural Science Foundation of China(61872176,62272215,62325205,62172204),Leading-edge Technology Program of Jiangsu Natural Science Foundation(BK20202001),National Key R&D Program of China(2020YFB1005900) and Key Program of the Natural Science Foundation of Jiangsu Province(BK20243053).

摘要/Abstract

摘要： 在大语言模型服务系统中,用户体验是一个关键考量因素。服务级目标和系统级指标是两种关键的性能衡量标准,前者关注单个请求的体验,后者关注系统的整体性能。然而,现有的度量标准存在两个与直觉相悖的问题:1)通过刻意延迟部分词元的交付可以提升服务级目标指标;2)主动丢弃不满足服务级目标的请求可以改善系统级指标。为解决上述问题,重新分析了大语言模型服务中的服务级目标和系统级指标,并提出了一种与用户体验更一致的新型服务级目标。基于此服务级目标,提出了一种名为“平滑有效吞吐量”的综合度量框架,其通过整合服务级目标和系统级指标来反映大语言模型服务中用户体验的本质。利用该统一框架,对不同大语言模型服务系统在多种工作负载下的性能进行了重新评估。评估结果表明,所提出的度量框架能够对词元交付和请求处理提供更全面的评估维度,并有效地捕捉在不同服务策略下用户体验与系统性能的最优点。

关键词: 大语言模型, 推理服务系统, 服务级目标, 调度

Abstract: In Large Language Model(LLM) serving systems,user experience is a critical consideration.Service-Level Objectives(SLOs) and System-Level Metrics(SLMs) are two key performance measures:the former focuses on the experience of individual requests,while the latter reflects the overall performance of the system.However,existing metrics exhibit two counterintuitive issues:1) manually delaying the delivery of some tokens can improve SLOs;2) actively abandoning requests that do not meet SLOs can improve SLMs.To address these issues,the definitions of SLOs and SLMs in LLM serving are revisited and a new type of SLO is proposed that aligns more closely with actual user experience.Based on this SLO,a comprehensive metric framework called smooth goodput is developed,which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving.Through this unified framework,the performance of different LLM serving systems under multiple workloads is reassessed.Eva-luation results show that the proposed metric framework provides a more comprehensive view of token delivery and request processing,and effectively captures the optimal point of user experience and system performance with different serving strategies.

Key words: Large language model, Inference serving system, Service level objectives, Scheduling

中图分类号:

TP319

王智彬, 李世鹏, 周宇航, 李雪, 张中辉, 蒋智威, 顾荣, 田臣, 陈贵海, 仲盛. 大语言模型服务系统服务级目标和系统级指标优化[J]. 计算机科学, 2026, 53(3): 23-32. https://doi.org/10.11896/jsjkx.250900173

WANG Zhibin, LI Shipeng, ZHOU Yuhang, LI Xue, ZHANG Zhonghui, JIANG Zhiwei, GU Rong, TIAN Chen, CHEN Guihai, ZHONG Sheng. Optimization of Service Level Objectives and System Level Metrics in Large Language ModelServing System[J]. Computer Science, 2026, 53(3): 23-32. https://doi.org/10.11896/jsjkx.250900173

参考文献

[1]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[2]WAKE A,CHEN B,LYU C X,et al.Yi-Lightning Technical Report[J].arXiv:2412.01253,2024.
[3]Google-DeepMind.Gemini 2.0[EB/OL].[2025-10-15].https://deepmind.google/technologies/gemini/.
[4]xAI.Bringing Grok to Everyone[EB/OL].[2025-10-15].https://x.ai/.
[5]OpenAI.ChatGPT[EB/OL].[2025-10-15].https://chat.-openai.com.
[6]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena[C]//Advances in Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2023:46595-46623.
[7]MONTAGNA S,FERRETTI S,KLOPFENSTEIN L C,et al.Data decentralisation of LLM-based chatbot systems in chronic disease self-management[C]//Proceedings of the 2023 ACM Conference on Information Technology for Social Good.New York:ACM,2023:205-212.
[8]VU M D,WANG H,LI Z,et al.GPTVoiceTasker:LLM-po-wered virtual assistant for smartphone[J].arXiv:2401.14268,2024.
[9]DONG X L,MOON S,XU Y E,et al.Towards next-generation intelligent assistants leveraging llm techniques[C]//Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM,2023:5792-5793.
[10]YU G I,JEONG J S,KIM G W,et al.Orca:A distributed serving system for {Transformer-Based} generative models[C]//16th USENIX Symposium on Operating Systems Design and Implementation(OSDI 22).Carlsbad,CA:USENIX Association,2022:521-538.
[11]KWON W,LI Z,ZHUANG S,et al.Efficient Memory Management for Large Language Model Serving with PagedAttention[C]//Proceedings of the 29th Symposium on Operating Systems Principles.New York:ACM,2023:611-626.
[12]CHENG K,HU W,WANG Z,et al.Enabling efficient batchserving for lmaas via generation length prediction[C]//2024 IEEE International Conference on Web Services(ICWS).IEEE,2024:853-864.
[13]ZHANG P,SU L,YANG J,et al.Topology-aware PreemptiveScheduling for Co-located LLM Workloads[J].arXiv:2411.11560,2024.
[14]ZHU K,ZHAO Y,ZHAO L,et al.NanoFlow:Towards Optimal Large Language Model Serving Throughput[J].arXiv:2408.12757,2024.
[15]GUO D,YANG D,ZHANG H,et al.DeepSeek-R1:Incentivizing Reasoning Capability in LLMs via Reinforcement Learning[J].arXiv:2501.12948,2025.
[16]JAECH A,KALAI A,LERER A,et al.OpenAI o1 System Card[J].arXiv:2412.16720,2024.
[17]ONG I,ALMAHAIRI A,WU V,et al.RouteLLM:Learning to Route LLMs with Preference Data[J].arXiv:2406.18665,2024.
[18]PATEL P,CHOUKSE E,ZHANG C,et al.Splitwise:Efficient Generative LLM Inference Using Phase Splitting[C]//2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture(ISCA).IEEE,2024:118-132.
[19]AGRAWAL A,KEDIA N,PANWAR A,et al.Taming {Throu-ghput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}[C]//18th USENIX Symposium on Operating Systems Design and Implementation(OSDI 24).Santa Clara,CA:USENIX Association,2024:117-134.
[20]ZHONG Y,LIU S,CHEN J,et al.DistServe:DisaggregatingPrefill and Decoding for Goodput-optimized Large Language Model Serving[J].arXiv:2401.09670,2024.
[21]CHENG K,WANG Z,HU W,et al.SCOOT:SLO-Oriented Performance Tuning for LLM Inference Engines[J].arXiv:2408.04323,2024.
[22]PATKE A,REDDY D,JHA S,et al.One Queue Is All You Need:Resolving Head-of-Line Blocking in Large Language Model Serving[J].arXiv:2407.00047,2024.
[23]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[J].arXiv:1706.03762,2017.
[24]WEINREICH H,OBENDORF H,HERDER E,et al.Not quite the average:An empirical study of Web use[J].ACM Transactions on the Web,2008,2(1):1-31.
[25]SKADBERG Y X,KIMMEL J R.Visitors’ flow experiencewhile browsing a Web site:its measurement,contributing factors and consequences[J].Computers in Human Behavior,2004,20(3):403-422.
[26]EGGER S,HOSSFELD T,SCHATZ R,et al.Waiting times in quality of experience for web based services[C]//2012 Fourth International Workshop on Quality of Multimedia Experience.IEEE,2012:86-96.
[27]HU C,HUANG H,XU L,et al.Inference without interference:Disaggregate llm inference for mixed downstream workloads[J].arXiv:2401.11181,2024.
[28]SUN B,HUANG Z,ZHAO H,et al.Llumnix:Dynamic Scheduling for Large Language Model Serving[C]//18th USENIX Symposium on Operating Systems Design and Implementation(OSDI 24).Santa Clara,CA:USENIX Association,2024:173-191.
[29]KOSSMANN F,FONTAINE B,KHUDIA D,et al.Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs[J].arXiv:2410.17840,2024.
[30]WU B,ZHONG Y,ZHANG Z,et al.Fast Distributed Inference Serving for Large Language Models[J].arXiv:2305.05920,2023.
[31]JIANG X,ZHOU Y,CAO S,et al.NEO:Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference[J].ar-Xiv:2411.01142,2024.
[32]BRYSBAERT M.How many words do we read per minute? A review and meta-analysis of reading rate[J].Journal of Memory and Language,2019,109:104047.
[33]QIAO Y,ANZAI S,YU S,et al.ConServe:Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving[J].arXiv:2410.01228,2024.
[34]WANG Z,LI S,LI X,et al.Echo:Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving[J].arXiv:2504.03651,2025.
[35]ZHAO Y,YANG S,ZHU K,et al.BlendServe:Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching[J].arXiv:2411.16102,2024.
[36]LIU J,WU Z,CHUNG J W,et al.Andes:Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Ser-vices[J].arXiv:2404.16283,2024.
[37]AGRAWAL A,PANWAR A,MOHAN J,et al.Sarathi:Effi-cient llm inference by piggybacking decodes with chunked prefills[J].arXiv:2308.16369,2023.
[38]QIN R,LI Z,HE W,et al.Mooncake:A KVCache-centric Disaggregated Architecture for LLM Serving[J].arXiv:2407.00079,2024.
[39]YANG A,YANG B S,ZHANG B C,et al.Qwen2.5 Technical Report[J].arXiv:2412.15115,2024.
[40]ECCLESTON D.ShareGPT[DB/OL].[2025-10-15].https://github.com/domeccleston/sharegpt.
[41]LI J,WANG M,ZHENG Z,et al.LooGLE:Can Long-Context Language Models Understand Long Contexts?[J].arXiv:2311.04939,2023.
[42]ZHAO Y S,WANG Y D,JI M Y.Overview of reasoning with large language models based on thought chain prompts[J].Journal of Harbin Vocational and Technical College,2025(4):5-7.
[43]SHEN Y,ZHANG J,HUANG J,et al.DAST:Difficulty-Adaptive Slow-Thinking for Large Reasoning Models[J].arXiv:2503.04472,2025.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

大语言模型服务系统服务级目标和系统级指标优化

Optimization of Service Level Objectives and System Level Metrics in Large Language ModelServing System

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0