Computer Science ›› 2026, Vol. 53 ›› Issue (3): 23-32.doi: 10.11896/jsjkx.250900173

• Intelligent Information System Based on AGI Technology • Previous Articles     Next Articles

Optimization of Service Level Objectives and System Level Metrics in Large Language ModelServing System

WANG Zhibin1, LI Shipeng1,2, ZHOU Yuhang1, LI Xue2, ZHANG Zhonghui1, JIANG Zhiwei1, GU Rong1, TIAN Chen1, CHEN Guihai1, ZHONG Sheng1   

  1. 1 State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210023, China
    2 Alibaba Group, Hangzhou 310000, China
  • Received:2025-09-29 Revised:2025-11-29 Published:2026-03-12
  • About author:WANG Zhibin,born in 1996,Ph.D,is a member of CCF(No.62267M).His main research interests include graph computing,mining learning,and machine learning system.
    GU Rong,born in 1988,Ph.D,associate professor,Ph.D supervisor,is a member of CCF(No.32327S).His main research interests include cloud and big data computing systems,distributed AI training and inference system,intelligent data mana-gement system.
  • Supported by:
    Nanjing “U35” Talent Cultivation Program(U(2024)001),the National Natural Science Foundation of China(61872176,62272215,62325205,62172204),Leading-edge Technology Program of Jiangsu Natural Science Foundation(BK20202001),National Key R&D Program of China(2020YFB1005900) and Key Program of the Natural Science Foundation of Jiangsu Province(BK20243053).

Abstract: In Large Language Model(LLM) serving systems,user experience is a critical consideration.Service-Level Objectives(SLOs) and System-Level Metrics(SLMs) are two key performance measures:the former focuses on the experience of individual requests,while the latter reflects the overall performance of the system.However,existing metrics exhibit two counterintuitive issues:1) manually delaying the delivery of some tokens can improve SLOs;2) actively abandoning requests that do not meet SLOs can improve SLMs.To address these issues,the definitions of SLOs and SLMs in LLM serving are revisited and a new type of SLO is proposed that aligns more closely with actual user experience.Based on this SLO,a comprehensive metric framework called smooth goodput is developed,which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving.Through this unified framework,the performance of different LLM serving systems under multiple workloads is reassessed.Eva-luation results show that the proposed metric framework provides a more comprehensive view of token delivery and request processing,and effectively captures the optimal point of user experience and system performance with different serving strategies.

Key words: Large language model, Inference serving system, Service level objectives, Scheduling

CLC Number: 

  • TP319
[1]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[2]WAKE A,CHEN B,LYU C X,et al.Yi-Lightning Technical Report[J].arXiv:2412.01253,2024.
[3]Google-DeepMind.Gemini 2.0[EB/OL].[2025-10-15].https://deepmind.google/technologies/gemini/.
[4]xAI.Bringing Grok to Everyone[EB/OL].[2025-10-15].https://x.ai/.
[5]OpenAI.ChatGPT[EB/OL].[2025-10-15].https://chat.-openai.com.
[6]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena[C]//Advances in Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2023:46595-46623.
[7]MONTAGNA S,FERRETTI S,KLOPFENSTEIN L C,et al.Data decentralisation of LLM-based chatbot systems in chronic disease self-management[C]//Proceedings of the 2023 ACM Conference on Information Technology for Social Good.New York:ACM,2023:205-212.
[8]VU M D,WANG H,LI Z,et al.GPTVoiceTasker:LLM-po-wered virtual assistant for smartphone[J].arXiv:2401.14268,2024.
[9]DONG X L,MOON S,XU Y E,et al.Towards next-generation intelligent assistants leveraging llm techniques[C]//Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM,2023:5792-5793.
[10]YU G I,JEONG J S,KIM G W,et al.Orca:A distributed serving system for {Transformer-Based} generative models[C]//16th USENIX Symposium on Operating Systems Design and Implementation(OSDI 22).Carlsbad,CA:USENIX Association,2022:521-538.
[11]KWON W,LI Z,ZHUANG S,et al.Efficient Memory Management for Large Language Model Serving with PagedAttention[C]//Proceedings of the 29th Symposium on Operating Systems Principles.New York:ACM,2023:611-626.
[12]CHENG K,HU W,WANG Z,et al.Enabling efficient batchserving for lmaas via generation length prediction[C]//2024 IEEE International Conference on Web Services(ICWS).IEEE,2024:853-864.
[13]ZHANG P,SU L,YANG J,et al.Topology-aware PreemptiveScheduling for Co-located LLM Workloads[J].arXiv:2411.11560,2024.
[14]ZHU K,ZHAO Y,ZHAO L,et al.NanoFlow:Towards Optimal Large Language Model Serving Throughput[J].arXiv:2408.12757,2024.
[15]GUO D,YANG D,ZHANG H,et al.DeepSeek-R1:Incentivizing Reasoning Capability in LLMs via Reinforcement Learning[J].arXiv:2501.12948,2025.
[16]JAECH A,KALAI A,LERER A,et al.OpenAI o1 System Card[J].arXiv:2412.16720,2024.
[17]ONG I,ALMAHAIRI A,WU V,et al.RouteLLM:Learning to Route LLMs with Preference Data[J].arXiv:2406.18665,2024.
[18]PATEL P,CHOUKSE E,ZHANG C,et al.Splitwise:Efficient Generative LLM Inference Using Phase Splitting[C]//2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture(ISCA).IEEE,2024:118-132.
[19]AGRAWAL A,KEDIA N,PANWAR A,et al.Taming {Throu-ghput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}[C]//18th USENIX Symposium on Operating Systems Design and Implementation(OSDI 24).Santa Clara,CA:USENIX Association,2024:117-134.
[20]ZHONG Y,LIU S,CHEN J,et al.DistServe:DisaggregatingPrefill and Decoding for Goodput-optimized Large Language Model Serving[J].arXiv:2401.09670,2024.
[21]CHENG K,WANG Z,HU W,et al.SCOOT:SLO-Oriented Performance Tuning for LLM Inference Engines[J].arXiv:2408.04323,2024.
[22]PATKE A,REDDY D,JHA S,et al.One Queue Is All You Need:Resolving Head-of-Line Blocking in Large Language Model Serving[J].arXiv:2407.00047,2024.
[23]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[J].arXiv:1706.03762,2017.
[24]WEINREICH H,OBENDORF H,HERDER E,et al.Not quite the average:An empirical study of Web use[J].ACM Transactions on the Web,2008,2(1):1-31.
[25]SKADBERG Y X,KIMMEL J R.Visitors’ flow experiencewhile browsing a Web site:its measurement,contributing factors and consequences[J].Computers in Human Behavior,2004,20(3):403-422.
[26]EGGER S,HOSSFELD T,SCHATZ R,et al.Waiting times in quality of experience for web based services[C]//2012 Fourth International Workshop on Quality of Multimedia Experience.IEEE,2012:86-96.
[27]HU C,HUANG H,XU L,et al.Inference without interference:Disaggregate llm inference for mixed downstream workloads[J].arXiv:2401.11181,2024.
[28]SUN B,HUANG Z,ZHAO H,et al.Llumnix:Dynamic Scheduling for Large Language Model Serving[C]//18th USENIX Symposium on Operating Systems Design and Implementation(OSDI 24).Santa Clara,CA:USENIX Association,2024:173-191.
[29]KOSSMANN F,FONTAINE B,KHUDIA D,et al.Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs[J].arXiv:2410.17840,2024.
[30]WU B,ZHONG Y,ZHANG Z,et al.Fast Distributed Inference Serving for Large Language Models[J].arXiv:2305.05920,2023.
[31]JIANG X,ZHOU Y,CAO S,et al.NEO:Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference[J].ar-Xiv:2411.01142,2024.
[32]BRYSBAERT M.How many words do we read per minute? A review and meta-analysis of reading rate[J].Journal of Memory and Language,2019,109:104047.
[33]QIAO Y,ANZAI S,YU S,et al.ConServe:Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving[J].arXiv:2410.01228,2024.
[34]WANG Z,LI S,LI X,et al.Echo:Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving[J].arXiv:2504.03651,2025.
[35]ZHAO Y,YANG S,ZHU K,et al.BlendServe:Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching[J].arXiv:2411.16102,2024.
[36]LIU J,WU Z,CHUNG J W,et al.Andes:Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Ser-vices[J].arXiv:2404.16283,2024.
[37]AGRAWAL A,PANWAR A,MOHAN J,et al.Sarathi:Effi-cient llm inference by piggybacking decodes with chunked prefills[J].arXiv:2308.16369,2023.
[38]QIN R,LI Z,HE W,et al.Mooncake:A KVCache-centric Disaggregated Architecture for LLM Serving[J].arXiv:2407.00079,2024.
[39]YANG A,YANG B S,ZHANG B C,et al.Qwen2.5 Technical Report[J].arXiv:2412.15115,2024.
[40]ECCLESTON D.ShareGPT[DB/OL].[2025-10-15].https://github.com/domeccleston/sharegpt.
[41]LI J,WANG M,ZHENG Z,et al.LooGLE:Can Long-Context Language Models Understand Long Contexts?[J].arXiv:2311.04939,2023.
[42]ZHAO Y S,WANG Y D,JI M Y.Overview of reasoning with large language models based on thought chain prompts[J].Journal of Harbin Vocational and Technical College,2025(4):5-7.
[43]SHEN Y,ZHANG J,HUANG J,et al.DAST:Difficulty-Adaptive Slow-Thinking for Large Reasoning Models[J].arXiv:2503.04472,2025.
[1] ZHOU Yueyuan, LU Guanze, XIANG Jiawei, ZHANG Jiawei, SHAO En, HE Xin. Training System for Large Language Models Based on Adaptive Transpose on Hygon DCU [J]. Computer Science, 2026, 53(3): 33-40.
[2] CHEN Han, XU Zefeng, JIANG Jiu, FAN Fan, ZHANG Junjian, HE Chu, WANG Wenwei. Large Language Model and Deep Network Based Cognitive Assessment Automatic Diagnosis [J]. Computer Science, 2026, 53(3): 41-51.
[3] WU Xianjie, LI Tongliang, LI Zhoujun. Survey of Table Question Answering Research [J]. Computer Science, 2026, 53(3): 295-306.
[4] XU Cheng, LIU Yuxuan, WANG Xin, ZHANG Cheng, YAO Dengfeng, YUAN Jiazheng. Review of Speech Disorder Assessment Methods Driven by Large Language Models [J]. Computer Science, 2026, 53(3): 307-320.
[5] LI Wenli, FENG Xiaonian, QIAN Tieyun. Few-shot Continuous Toxicity Detection Based on Large Language Model Augmentation [J]. Computer Science, 2026, 53(3): 321-330.
[6] CHEN Yuyin, LI Guanfeng, QIN Jing, XIAO Yuhang. Survey on Complex Logical Query Methods in Knowledge Graphs [J]. Computer Science, 2026, 53(2): 273-288.
[7] GUO Luxiang, WANG Yueyu, LI Qianyue, LI Shasha, LIU Xiaodong, JI Bin, YU Jie. Comprehensive Survey of LLM-based Agent Operating Systems [J]. Computer Science, 2026, 53(1): 1-11.
[8] LIU Lilong, LIU Guoming, QI Baoyuan, DENG Xueshan, XUE Dizhan, QIAN Shengsheng. Efficient Inference Techniques of Large Models in Real-world Applications:A Comprehensive Survey [J]. Computer Science, 2026, 53(1): 12-28.
[9] SHAO Xinyi, ZHU Jingwei, ZHANG Liang. LLM-based Business Process Adaptation Method to Respond Long-tailed Changes [J]. Computer Science, 2026, 53(1): 29-38.
[10] WANG Haoyan, LI Chongshou, LI Tianrui. Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network [J]. Computer Science, 2026, 53(1): 231-240.
[11] ZHAO Xiaosong, HUANG Chao, LI Jian, KANG Yulong. Energy-efficient Task Scheduling on Heterogeneous Multicore Real-time Systems with Synchronization [J]. Computer Science, 2026, 53(1): 241-251.
[12] XU Jinlong, WANG Gengwu, HAN Lin, NIE Kai, LI Haoran, CHEN Mengyao, LIU Haohao. Research on Parallel Scheduling Strategy Optimization Technology Based on Sunway Compiler [J]. Computer Science, 2025, 52(9): 137-143.
[13] LIU Leyuan, CHEN Gege, WU Wei, WANG Yong, ZHOU Fan. Survey of Data Classification and Grading Studies [J]. Computer Science, 2025, 52(9): 195-211.
[14] CAI Qihang, XU Bin, DONG Xiaodi. Knowledge Graph Completion Model Using Semantically Enhanced Prompts and Structural Information [J]. Computer Science, 2025, 52(9): 282-293.
[15] ZHONG Boyang, RUAN Tong, ZHANG Weiyan, LIU Jingping. Collaboration of Large and Small Language Models with Iterative Reflection Framework for Clinical Note Summarization [J]. Computer Science, 2025, 52(9): 294-302.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!