计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 23-32.doi: 10.11896/jsjkx.250900173
王智彬1, 李世鹏1,2, 周宇航1, 李雪2, 张中辉1, 蒋智威1, 顾荣1, 田臣1, 陈贵海1, 仲盛1
WANG Zhibin1, LI Shipeng1,2, ZHOU Yuhang1, LI Xue2, ZHANG Zhonghui1, JIANG Zhiwei1, GU Rong1, TIAN Chen1, CHEN Guihai1, ZHONG Sheng1
摘要: 在大语言模型服务系统中,用户体验是一个关键考量因素。服务级目标和系统级指标是两种关键的性能衡量标准,前者关注单个请求的体验,后者关注系统的整体性能。然而,现有的度量标准存在两个与直觉相悖的问题:1)通过刻意延迟部分词元的交付可以提升服务级目标指标;2)主动丢弃不满足服务级目标的请求可以改善系统级指标。为解决上述问题,重新分析了大语言模型服务中的服务级目标和系统级指标,并提出了一种与用户体验更一致的新型服务级目标。基于此服务级目标,提出了一种名为“平滑有效吞吐量”的综合度量框架,其通过整合服务级目标和系统级指标来反映大语言模型服务中用户体验的本质。利用该统一框架,对不同大语言模型服务系统在多种工作负载下的性能进行了重新评估。评估结果表明,所提出的度量框架能够对词元交付和请求处理提供更全面的评估维度,并有效地捕捉在不同服务策略下用户体验与系统性能的最优点。
中图分类号:
| [1]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023. [2]WAKE A,CHEN B,LYU C X,et al.Yi-Lightning Technical Report[J].arXiv:2412.01253,2024. [3]Google-DeepMind.Gemini 2.0[EB/OL].[2025-10-15].https://deepmind.google/technologies/gemini/. [4]xAI.Bringing Grok to Everyone[EB/OL].[2025-10-15].https://x.ai/. [5]OpenAI.ChatGPT[EB/OL].[2025-10-15].https://chat.-openai.com. [6]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena[C]//Advances in Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2023:46595-46623. [7]MONTAGNA S,FERRETTI S,KLOPFENSTEIN L C,et al.Data decentralisation of LLM-based chatbot systems in chronic disease self-management[C]//Proceedings of the 2023 ACM Conference on Information Technology for Social Good.New York:ACM,2023:205-212. [8]VU M D,WANG H,LI Z,et al.GPTVoiceTasker:LLM-po-wered virtual assistant for smartphone[J].arXiv:2401.14268,2024. [9]DONG X L,MOON S,XU Y E,et al.Towards next-generation intelligent assistants leveraging llm techniques[C]//Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM,2023:5792-5793. [10]YU G I,JEONG J S,KIM G W,et al.Orca:A distributed serving system for {Transformer-Based} generative models[C]//16th USENIX Symposium on Operating Systems Design and Implementation(OSDI 22).Carlsbad,CA:USENIX Association,2022:521-538. [11]KWON W,LI Z,ZHUANG S,et al.Efficient Memory Management for Large Language Model Serving with PagedAttention[C]//Proceedings of the 29th Symposium on Operating Systems Principles.New York:ACM,2023:611-626. [12]CHENG K,HU W,WANG Z,et al.Enabling efficient batchserving for lmaas via generation length prediction[C]//2024 IEEE International Conference on Web Services(ICWS).IEEE,2024:853-864. [13]ZHANG P,SU L,YANG J,et al.Topology-aware PreemptiveScheduling for Co-located LLM Workloads[J].arXiv:2411.11560,2024. [14]ZHU K,ZHAO Y,ZHAO L,et al.NanoFlow:Towards Optimal Large Language Model Serving Throughput[J].arXiv:2408.12757,2024. [15]GUO D,YANG D,ZHANG H,et al.DeepSeek-R1:Incentivizing Reasoning Capability in LLMs via Reinforcement Learning[J].arXiv:2501.12948,2025. [16]JAECH A,KALAI A,LERER A,et al.OpenAI o1 System Card[J].arXiv:2412.16720,2024. [17]ONG I,ALMAHAIRI A,WU V,et al.RouteLLM:Learning to Route LLMs with Preference Data[J].arXiv:2406.18665,2024. [18]PATEL P,CHOUKSE E,ZHANG C,et al.Splitwise:Efficient Generative LLM Inference Using Phase Splitting[C]//2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture(ISCA).IEEE,2024:118-132. [19]AGRAWAL A,KEDIA N,PANWAR A,et al.Taming {Throu-ghput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}[C]//18th USENIX Symposium on Operating Systems Design and Implementation(OSDI 24).Santa Clara,CA:USENIX Association,2024:117-134. [20]ZHONG Y,LIU S,CHEN J,et al.DistServe:DisaggregatingPrefill and Decoding for Goodput-optimized Large Language Model Serving[J].arXiv:2401.09670,2024. [21]CHENG K,WANG Z,HU W,et al.SCOOT:SLO-Oriented Performance Tuning for LLM Inference Engines[J].arXiv:2408.04323,2024. [22]PATKE A,REDDY D,JHA S,et al.One Queue Is All You Need:Resolving Head-of-Line Blocking in Large Language Model Serving[J].arXiv:2407.00047,2024. [23]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[J].arXiv:1706.03762,2017. [24]WEINREICH H,OBENDORF H,HERDER E,et al.Not quite the average:An empirical study of Web use[J].ACM Transactions on the Web,2008,2(1):1-31. [25]SKADBERG Y X,KIMMEL J R.Visitors’ flow experiencewhile browsing a Web site:its measurement,contributing factors and consequences[J].Computers in Human Behavior,2004,20(3):403-422. [26]EGGER S,HOSSFELD T,SCHATZ R,et al.Waiting times in quality of experience for web based services[C]//2012 Fourth International Workshop on Quality of Multimedia Experience.IEEE,2012:86-96. [27]HU C,HUANG H,XU L,et al.Inference without interference:Disaggregate llm inference for mixed downstream workloads[J].arXiv:2401.11181,2024. [28]SUN B,HUANG Z,ZHAO H,et al.Llumnix:Dynamic Scheduling for Large Language Model Serving[C]//18th USENIX Symposium on Operating Systems Design and Implementation(OSDI 24).Santa Clara,CA:USENIX Association,2024:173-191. [29]KOSSMANN F,FONTAINE B,KHUDIA D,et al.Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs[J].arXiv:2410.17840,2024. [30]WU B,ZHONG Y,ZHANG Z,et al.Fast Distributed Inference Serving for Large Language Models[J].arXiv:2305.05920,2023. [31]JIANG X,ZHOU Y,CAO S,et al.NEO:Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference[J].ar-Xiv:2411.01142,2024. [32]BRYSBAERT M.How many words do we read per minute? A review and meta-analysis of reading rate[J].Journal of Memory and Language,2019,109:104047. [33]QIAO Y,ANZAI S,YU S,et al.ConServe:Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving[J].arXiv:2410.01228,2024. [34]WANG Z,LI S,LI X,et al.Echo:Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving[J].arXiv:2504.03651,2025. [35]ZHAO Y,YANG S,ZHU K,et al.BlendServe:Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching[J].arXiv:2411.16102,2024. [36]LIU J,WU Z,CHUNG J W,et al.Andes:Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Ser-vices[J].arXiv:2404.16283,2024. [37]AGRAWAL A,PANWAR A,MOHAN J,et al.Sarathi:Effi-cient llm inference by piggybacking decodes with chunked prefills[J].arXiv:2308.16369,2023. [38]QIN R,LI Z,HE W,et al.Mooncake:A KVCache-centric Disaggregated Architecture for LLM Serving[J].arXiv:2407.00079,2024. [39]YANG A,YANG B S,ZHANG B C,et al.Qwen2.5 Technical Report[J].arXiv:2412.15115,2024. [40]ECCLESTON D.ShareGPT[DB/OL].[2025-10-15].https://github.com/domeccleston/sharegpt. [41]LI J,WANG M,ZHENG Z,et al.LooGLE:Can Long-Context Language Models Understand Long Contexts?[J].arXiv:2311.04939,2023. [42]ZHAO Y S,WANG Y D,JI M Y.Overview of reasoning with large language models based on thought chain prompts[J].Journal of Harbin Vocational and Technical College,2025(4):5-7. [43]SHEN Y,ZHANG J,HUANG J,et al.DAST:Difficulty-Adaptive Slow-Thinking for Large Reasoning Models[J].arXiv:2503.04472,2025. |
|
||