计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 12-28.doi: 10.11896/jsjkx.250300030

• 大语言模型技术研究及应用 • 上一篇    下一篇

实际应用场景中的大模型高效推理技术综述

刘利龙1, 刘国明2, 齐保元3, 邓雪杉4, 薛迪展4, 钱胜胜4   

  1. 1 郑州大学河南先进技术研究院 郑州 450003;
    2 小米汽车科技有限公司集团技术委员会 北京 100085;
    3 北京小米松果电子有限公司集团技术委员会 北京 100085;
    4 中国科学院自动化研究所多模态人工智能系统全国重点实验室 北京 100190
  • 收稿日期:2025-03-06 修回日期:2025-07-02 发布日期:2026-01-08
  • 通讯作者: 钱胜胜(shengsheng.qian@nlpr.ia.ac.cn)
  • 作者简介:(liulilong0401@163.com)
  • 基金资助:
    国家重点研发计划(2023YFC3310700);国家自然科学基金(62276257)

Efficient Inference Techniques of Large Models in Real-world Applications:A Comprehensive Survey

LIU Lilong1, LIU Guoming2, QI Baoyuan3, DENG Xueshan4, XUE Dizhan4, QIAN Shengsheng4   

  1. 1 Henan Institute of Advanced Technology, Zhengzhou University, Zhengzhou 450003, China;
    2 Group Technical Committee, Xiaomi Automobile Technology Co., Ltd., Beijing 100085, China;
    3 Group Technical Committee, Beijing Xiaomi Pinecone Electronics Co., Ltd., Beijing 100085, China;
    4 State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2025-03-06 Revised:2025-07-02 Online:2026-01-08
  • About author:LIU Lilong,born in 2000,postgraduate.His main research interests include artificial intelligence and natural language processing.
    QIAN Shengsheng,born in 1991,Ph.D,professor,is a member of CCF(No.77702M).His main research interests include data mining and multimedia content analysis.
  • Supported by:
    National Key Research and Development Program of China(2023YFC3310700) and National Natural Science Foundation of China(62276257).

摘要: 近年来,大语言模型(Large Language Models,LLMs)技术迎来了快速发展,其在各行业的应用呈现出蓬勃增长的态势。从自然语言处理到智能推荐,再到信息检索和自动化写作,LLMs正逐渐成为许多领域中不可或缺的工具。然而,随着应用场景的逐渐多样化和需求的不断增加,LLMs推理效率问题日益凸显。在实际应用场景中,快速准确的推理能力对于响应用户请求、处理大规模数据和实时决策至关重要。为了应对这一挑战,学术界展开了广泛的研究和探索,致力于提高LLMs的推理效率。对此,全面调研了实际应用场景中有关LLMs高效推理的文献。首先,介绍了LLMs推理的原理,并分析了在实际应用场景中如何提高LLMs的推理效率。然后,引入了一个针对实际应用场景的分类系统,其主要分为3个层面,分别是算法优化层面、参数优化层面和系统优化层面;并对大模型进行相关研究的总结和归纳。最后,探讨了未来可能的研究方向。

关键词: 大语言模型, 高效推理, 实际应用场景, 算法优化, 参数优化, 系统优化

Abstract: In recent years,the technologies of LLMs have been rapidly developed,with their applications across various industries experiencing vigorous growth.From natural language processing to intelligent recommendations,and from information retrieval to automated writing,LLMs are becoming indispensable tools in many fields.However,with the diversification of application scena-rios and the increase in demands,the efficiency of LLM inference is becoming increasingly prominent.In practical applications,ra-pid and accurate inference capabilities are crucial for responding to user queries,handling large-scale data,and making real-time decisions.To address this challenge,academia has undertaken extensive research and exploration to enhance the inference efficiency of LLMs.This paper comprehensively surveys the literature on efficient LLM inference in practical application scenarios.Firstly,it introduces the principles of LLMs and analyzes how to improve LLM inference efficiency in practical application scenarios.Secondly,it proposes a taxonomy tailored for real-world applications,which consists of three main levels:algorithm optimization,parameter optimization,and system optimization.This survey summarizes and categorizes related work about LLMs.Finally,it discusses potential future research directions.

Key words: Large language models, Efficient inference, Practical application scenarios, Algorithm optimization, Parameter optimization, System optimization

中图分类号: 

  • TP391
[1]CHANG Y P,WANG X,WANG J D,et al.A survey on evaluation of large language models[J].arXiv:2307.03109,2023.
[2]OpenAI.2023.Introducing ChatGPT[EB/OL].https://openai.com/blog/chatgpt.
[3]Microsoft.Announcing microsoft copilot,your everyday aicompanion[EB/OL].[2023-12-04].https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/.
[4]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[5]KACHRIS C.A survey on hardware accelerators for large language models[J].arXiv:2401.09890,2024.
[6]ZHU X,LI J,LIU Y,et al.A survey on model compression for large language models[J].arXiv:2308.07633,2023.
[7]SHAO R R,LIU Y,ZHANG W,et al.A Survey of Knowledge Distillation in Deep Learning[J].Journal of Computer Science,2022,45(8):1638-1673.
[8]HUANG Z H,YANG S Z,LIN W,et al.A Survey of Know-ledge Distillation[J].Journal of Computer Science,2022,45(3):624-653.
[9]PARK S,CHOI J,LEE S,et al.A comprehensive survey ofcompression algorithms for language models[J].arXiv:2401.15347,2024.
[10]KHOSHNOODI M,JAIN V,GAO M,et al.A comprehensive survey of accelerated generation techniques in large language models[J].arXiv:2405.13019,2024.
[11]WANG W,CHEN W,LUO Y,et al.Model compression and efficient inference for large language models:A survey[J].arXiv:2402.09748,2024.
[12]ZHOU Z,NING X,HONG K,et al.A survey on efficient inference for large language models[J].arXiv:2404.14294,2024.
[13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems(NIPS’17).2017:6000-6010.
[14]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[15]GAO Y,XIONG Y,GAO X,et al.Retrieval-augmented generation for large language models:A survey[J].arXiv:2312.10997,2023.
[16]ZHOU W,JIANG Y E,COTTERELL R,et al.Efficientprompting via dynamic in-context learning[J].arXiv:2305.11170,2023.
[17]YIN F.VIG J,LABAN P,et al.Did you read the instructions? rethinking the effectiveness of task defi nitions in instruction learning[J].arXiv:2306.01150,2023.
[18]JUNG H,KIM K J.Discrete prompt compression with rein-forcement learning[J].IEEE Access,2024,12:72578-72587.
[19]XU F,SHI W,CHOI E.Recomp:Improving retrieval-augmented lms with compression and selective augmentation[J].arXiv:2310.04408,2023.
[20]LISKAVETS B,ROY S,USHAKOV M,et al.Task-agnosticPrompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor[J].arXiv:2502.13374,2025.
[21]WINGATE D,SHOEYBI M,SORENSEN T.Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models[J].arXiv:2210.03162,2022.
[22]MU J,LI X,GOODMAN N.Learning to compress prompts with gist tokens[J].Advances in Neural Information Processing Systems,2023,36:19327-19352.
[23]CHEVALIER A,WETTIG A,AJITH A,et al.Adapting language models to compress contexts[J].arXiv:2305.14788,2023.
[24]GE T,HU J,WANG L,et al.In-context autoencoder for context compression in a large language model[J].arXiv:2307.06945,2023.
[25]WANG H,ZHANG Z,HAN S.Spatten:Efficient sparse attention architecture with cascade token and head pruning[C]//2021 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2021:97-110.
[26]KIM S,SHEN S,THORSLEY D,et al.Learned token pruning for transformers[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.2022:784-794.
[27]WANG Z,CHEN J,ZHOU W,et al.Smarttrim:Adaptive to-kens and attention pruning for efficient vision-language models[J].arXiv:2305.15033,2023.
[28]FEDERICI M,BELLI D,VAN BAALEN M,et al.Efficient llm inference using dynamic input pruning and cache-aware masking[J].arXiv:2412.01380,2024.
[29]JIANG Z,XU F F,GAO L,et al.Active retrieval augmentedgeneration[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:7969-7992.
[30]SHI W,MIN S,YASUNAGA M,et al.Replug:Retrieval-aug-mented black-box language models[J].arXiv:2301.12652,2023.
[31]ASAI A,WU Z,WANG Y,et al.Self-rag:Learning to retrieve,generate,and critique through self-reflection[J].arXiv:2310.11511,2024.
[32]XIN J,TANG R,LEE J,et al.DeeBERT:Dynamic early exiting for accelerating BERT inference[J].arXiv:2004.12993,2020.
[33]SCHWARTZ R,STANOVSKY G,SWAYAMDIPTA S,et al.The right tool for the job:Matching model and instance complexities[J].arXiv:2004.07453,2020.
[34]ZHOU W,XU C,GE T,et al.Bert loses patience:Fast and robust inference with early exit[J].Advances in Neural Information Processing Systems,2020,33:18330-18341.
[35]ZHANG Z,ZHU W,ZHANG J,et al.PCEE-BERT:Accelerating BERT inference via patient and confident early exiting[C]//Findings of the Association for Computational Linguistics:NAACL 2022.2022:327-338.
[36]WANG J,CHEN K,CHEN G,et al.Skipbert:Efficient infe-rence with shallow layer skipping[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:7287-7301.
[37]DIN A Y,KARIDI T,CHOSHEN L,et al.Jump to conclusions:Short-cutting transformers with linear transformations[J].ar-Xiv:2303.09435,2023.
[38]SCHUSTER T,FISCH A,GUPTA J,et al.Confident adaptive language modeling[J].Advances in Neural Information Proces-sing Systems,2022,35:17456-17472.
[39]TANG S,WANG Y,KONG Z,et al.You need multiple exiting:Dynamic early exiting for accelerating unifiedvision language model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:10781-10791.
[40]XU G,HAO J,SHEN L,et al.Lgvit:Dynamic early exiting for accelerating vision transformer[C]//Proceedings of the 31st ACM International Conference on Multimedia.2023:9103-9114.
[41]MEISTER C,VIEIRA T,COTTERELL R.Best-first beamsearch[J].Transactions of the Association for Computational Linguistics,2020,8:795-809.
[42]FAN A,LEWIS M,DAUPHIN Y.Hierarchical neural storygeneration[J].arXiv:1805.04833,2018.
[43]HOLTZMAN A,BUYS J,DU L,et al.The curious case of neural text degeneration[J].arXiv:1904.09751,2019.
[44]WANG X,XIONG Y,WEI Y,et al.LightSeq:A high perfor-mance inference library for transformers[J].arXiv:2010.13887,2020.
[45]LI L,LIN Y,CHEN D,et al.Cascadebert:Accelerating infe-rence of pre-trained language models via calibrated complete models cascade[J].arXiv:2012.14682,2020.
[46]WANG Y,CHEN K,TAN H,et al.Tabi:An efficient multi-le-vel inference system for large language models[C]//Proceedings of the Eighteenth European Conference on Computer Systems.2023:233-248.
[47]CHEN L,ZAHARIA M,ZOU J.Frugalgpt:How to use large language models while reducing cost and improving performance[J].arXiv:2305.05176,2023.
[48]YUE M,ZHAO J,ZHANG M,et al.Large language model cascades with mixture of thoughts representations for cost-efficient reasoning[J].arXiv:2310.03094,2023.
[49]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-thoughtprompting elicits reasoning in large language models[J].Advances in neural information processing systems,2022,35:24824-24837.
[50]CHEN W,MA X,WANG X,et al.Program of thoughts promp-ting:Disentangling computation from reasoning for numerical reasoning tasks[J].arXiv:2211.12588,2022.
[51]SHAZEER N,MIRHOSEINI A,MAZIARZ K,et al.Outra-geously large neural networks:The sparsely-gated mixture-of-experts layer[J].arXiv:1701.06538, 2017.
[52]XIA H,YANG Z,DONG Q,et al.Unlocking efficiency in large language model inference:A comprehensive survey of speculative decoding[J].arXiv:2401.07851,2024.
[53]ZHOU Y,LYU K,RAWAT A S,et al.Distillspec:Improvingspeculative decoding via knowledge distillation[J].arXiv:2310.08461,2023.
[54]ZHANG J,WANG J,LI H,et al.Draft & verify:Lossless large language model acceleration via self-speculative decoding[J].arXiv:2309.08168,2023.
[55]LIU X,HU L,BAILIS P,et al.Online speculative decoding[J].arXiv:2310.07177,2023.
[56]MONEA G,JOULIN A,GRAVE E.Pass:Parallel speculative sampling[J].arXiv:2311.13581,2023.
[57]HE Z,ZHONG Z,CAI T,et al.Rest:Retrieval-based speculative decoding[J].arXiv:2311.08252,2023.
[58]MIAO X,OLIARO G,ZHANG Z,et al.Specinfer:Accelerating generative large language model serving with tree-based speculative inference and verification[J].arXiv:2305.09781,2023.
[59]FU Y,BAILIS P,STOICA I,et al.Break the sequential depen-dency of llm inference using lookahead decoding[J].arXiv:2402.02057,2024.
[60]CAI T,LI Y,GENG Z,et al.Medusa:Simple llm inference acceleration framework with multiple decoding heads[J].arXiv:2401.10774,2024.
[61]LI Y,ZHANG C,ZHANG H.Eagle:Lossless acceleration of llm decoding by feature extrapolation[EB/OL].[2023-12-08].https://sites.google.com/view/eagle-llm.
[62]SUN Z,SURESH A T,RO J H,et al.Spectr:Fast speculative decoding via optimal transport[J].Advances in Neural Information Processing Systems,2023,36:30222-30242.
[63]LI S,CHEN J,SHEN Y,et al.Explanations from large language models make small reasoners better[J].arXiv:2210.06726,2022.
[64]YANG G,LO D,MULLINS R,et al.Dynamic stashing quantization for efficient transformer training[J].arXiv:2303.05295,2023.
[65]CHENG Y,WANG D,ZHOU P,et al.A survey of model compression and acceleration for deep neural networks[J].arXiv:1710.09282,2017.
[66]FRANTAR E,ASHKBOOS S,HOEFLER T,et al.GPTQ:Accurate quantization for generative pre-trained transformers[C]//The Eleventh International Conference on Learning Representations.2022.
[67]PARK G,PARK B,KIM M,et al.Lut-gemm:Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models[J].arXiv:2206.09557,2022.
[68]LIN J,TANG J,TANG H,et al.AWQ:Activation-awareWeight Quantization for On-Device LLM Compression and Acceleration[J].Proceedings of Machine Learning and Systems,2024,6:87-100.
[69]KIM S,HOOPER C,GHOLAMI A,et al.Squeezellm:Dense-and-sparse quantization[J].arXiv:2306.07629,2023.
[70]YAO Z,WU X,LI C,et al.Zeroquant-v2:Exploring post-trai-ning quantization in llms from comprehensive study to low rank compensation[J].arXiv:2303.08302,2023.
[71]DETTMERS T,LEWIS M,BELKADA Y,et al.Gpt3.int8():8-bit matrix multiplication for transformers at scale[J].Advances in Neural Information Processing Systems,2022,35:30318-30332.
[72]XIAO G,LIN J,SEZNEC M,et al.Smoothquant:Accurate and efficient post-training quantization for large language models[C]//International Conference on Machine Learning.PMLR,2023:38087-38099.
[73]YUAN Z,NIU L,LIU J,et al.Rptq:Reorder-based post-trai-ning quantization for large language models[J].arXiv:2304.01089,2023.
[74]YAO Z,YAZDANI AMINABADI R,ZHANG M,et al.Zeroquant:Efficient and affordable post-training quantization for large-scale transformers[J].Advances in Neural Information Processing Systems,2022,35:27168-27183.
[75]LIU Z,OGUZ B,ZHAO C,et al.Llm-qat:Data-free quantization aware training for large language models[C]//Findings of the Association for Computational Linguistics:ACL 2024.2024:467-484.
[76]SAXENA U,SHARIFY S,ROY K,et al.ResQ:Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals[J].arXiv:2412.14363,2024.
[77]ZENG B,JI B,LIU X,et al.LSAQ:Layer-Specific AdaptiveQuantization for Large Language Model Deployment[J].arXiv:2412.18135,2024.
[78]LIU S,LIU Z,HUANG X,et al.Llm-fp4:4-bit floating-pointquantized transformers[J].arXiv:2310.16836,2023.
[79]KIM J,LEE J H,KIM S,et al.Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization[J].arXiv:2305.14152,2024.
[80]DETTMERS T,PAGNONI A,HOLTZMAN A,et al.Qlora:Efficient finetuning of quantized LLMs[J].arXiv:2305.14314,2024.
[81]HU E J,SHEN Y,WALLIS P,et al.Lora:Low-rank adaptation of large language models[J].ICLR,2022,1(2):3.
[82]LI L,LI Q,ZHANG B,et al.Norm tweaking:High-performance low-bit quantization of large language models[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:18536-18544.
[83]LI Y,XU S,ZHANG B,et al.Q-vit:Accurate and fully quantized low-bit vision transformer[J].Advances in neural information processing systems,2022,35:34451-34463.
[84]FRANTAR E,ALISTARH D.Sparsegpt:Massive languagemodels can be accurately pruned in one-shot[C]//International Conference on Machine Learning.PMLR,2023:10323-10337.
[85]SUN M,LIU Z,BAIR A,et al.A simple and effective pruning approach for large language models[J].arXiv:2306.11695,2023.
[86]ZHANG M,CHEN H,SHEN C,et al.Loraprune:Pruning meetslow-rank parameter-efficient fine-tuning[J].arXiv:2305.18403,2023.
[87]CUNEGATTI E,CUSTODE L L,IACCA G.Zeroth-OrderAdaptive Neuron Alignment Based Pruning without Re-Training[J].arXiv:2411.07066,2024.
[88]MA X,FANG G,WANG X.Llm-pruner:On the structuralpruning of large language models[J].Advances in neural information processing systems,2023,36:21702-21720.
[89]GORDON A,EBAN E,NACHUM O,et al.Morphnet:Fast & simple resource-constrained structure learning ofdeep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:1586-1595.
[90]YANG T J,HOWARD A,CHEN B,et al.Netadapt:Platform-aware neural network adaptation for mobile applications[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:285-300.
[91]VAN DER OUDERAA T F A,NAGEL M,VAN BAALEN M,et al.The Surgeon[J].arXiv:2312.17244,2023.
[92]LEE J,KIM H.DCT-ViT:High-Frequency Pruned VisionTransformer with Discrete Cosine Transform[J].IEEE Access,2024,12:80386-80396.
[93]YU L,XIANG W.X-pruner:explainable pruning for visiontransformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:24355-24363.
[94]SANDRI F,CUNEGATTI E,IACCA G.2SSP:A Two-StageFramework for Structured Pruning of LLMs[J].arXiv:2501.17771,2025.
[95]MUÑOZ J P,YUAN J,JAIN N.Mamba-Shedder:Post-Transformer Compression for Efficient Selective Structured State Space Models[J].arXiv:2501.17088,2025.
[96]MUÑOZ J P,YUAN J,JAIN N.Multipruner:Balanced structure removal in foundation models[J].arXiv:2501.09949,2025.
[97]SORBER L,VAN BAREL M,DE LATHAUWER L.Optimization-based algorithms for tensor decompositions:Canonical polyadic decomposition,decomposition in rank-(L_r,L_r,1) terms,and a new generalization[J].SIAM Journal on Optimization,2013,23(2):695-720.
[98]MØRUP M,HANSEN L K,ARNFRED S M.Algorithms forsparse nonnegative Tucker decompositions[J].Neural computation,2008,20(8):2112-2131.
[99]SAHA R,SRIVASTAVA V,PILANCI M.Matrix compression via randomized low rank and low precision factorization[J].ar-Xiv:2310.11028,2023.
[100]KAUSHAL A,VAIDHYA T,RISH I.Lord:Low rank decomposition of monolingual code llms for one-shot compression[J].arXiv:2309.14021,2023.
[101]WANG X,ZHENG Y,WAN Z,et al.Svd-llm:Truncation-aware singular value decomposition for large language model compression[J].arXiv:2403.07378,2024.
[102]CHANG C C,SUNG Y Y,YU S,et al.FLORA:Fine-grained Low-Rank Architecture Search for Vision Transformer[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2024:2482-2491.
[103]BUCILUA C,CARUANA R,NICULESCU-MIZIL A.Modelcompression[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2006:535-541.
[104]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[105]DONG Q,LI L,DAI D,et al.A survey on in-context learning[J].arXiv:2301.00234,2022.
[106]HUANG Y,CHEN Y,YU Z,et al.In-context learning distillation:Transferring few-shot learning ability of pre-trained language models[J].arXiv:2212.10670,2022.
[107]GOYAL V,KHAN M,TIRUPATI A,et al.Enhancing Know-ledge Distillation for LLMs with Response-Priming Prompting[J].arXiv:2412.17846,2024.
[108]ZHOU Z,SHI J X,SONG P X,et al.LawGPT:A Chinese Legal Knowledge-Enhanced Large Language Model[J].arXiv:2406.04614,2024.
[109]CHEN Z,GAO Q,BOSSELUT A,et al.DISCO:Distilling counterfactuals with large language models[J].arXiv:2212.10534,2022.
[110]JIANG Y,CHAN C,CHEN M,et al.Lion:Adversarial distil-lation of proprietary large language models[J].arXiv:2305.12870,2023.
[111]GU Y,DONG L,WEI F,et al.Knowledge distillation of large language models[J].arXiv:2306.08543,2023.
[112]AGARWAL R,VIEILLARD N,STANCZYK P,et al.Gkd:Generalized knowledge distillation for auto-regressive sequence models[J].arXiv:2306.13649,2023.
[113]LIANG C,ZUO S,ZHANG Q,et al.Less is more:Task-aware layer-wise distillation for language model compression[C]//International Conference on Machine Learning.PMLR,2023:20852-20867.
[114]ZHANG C,YANG Y,LIU J,et al.Lifting the curse of capacity gap in distilling language models[J].arXiv:2305.12129,2023.
[115]CHEN X,CAO Q,ZHONG Y,et al.Dearkd:data-efficient early knowledge distillation for vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12052-12062.
[116]RADEVSKI G,GRUJICIC D,BLASCHKO M,et al.Multimodal distillation for egocentric action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:5213-5224.
[117]SHENG Y,ZHENG L,YUAN B,et al.Flexgen:High-throughput generative inference of large language models with a single gpu[C]//International Conference on Machine Learning.PMLR,2023:31094-31116.
[118]YU G I,JEONG J S,KIM G W,et al.Orca:A distributed serving system for {Transformer-Based} generative models[C]//16th USENIX Symposium on Operating Systems Design and Implementation(OSDI 22).2022:521-538.
[119]JIN Y,WU C F,BROOKS D,et al.$ S 3$:Increasing GPU Utilization during Generative Inference for Higher Throughput[J].Advances in Neural Information Processing Systems,2023,36:18015-18027.
[120]KWON W,LI Z,ZHUANG S,et al.Efficient memory management for large language model serving with pagedattention[C]//Proceedings of the 29th Symposium on Operating Systems Principles.2023:611-626.
[121]LIU J,CHUNG J W,WU Z,et al.Andes:Defining and enhancing quality-of-experience in llm-based text streaming services[J].arXiv:2404.16283,2024.
[122]AMINABADI R Y,RAJBHANDARI S,AWAN A A,et al.Deepspeed-inference:enabling efficient inference of transformer models at unprecedented scale[C]//SC22:International Confe-rence for High Performance Computing,Networking,Storage and Analysis.IEEE,2022:1-15.
[123]DAO T,HAZIZA D,MASSA F,et al.Flash-decoding for long-context inference.[EB/OL].[2023-10-13].https://pytorch.org/blog/flash-decoding/.
[124]HONG K,DAI G,XU J,et al.Flashdecoding++:Faster large language model inference on gpus[J].arXiv:2311.01282,2023.
[125]GONG R,BAI S,WU S,et al.Past-future scheduler for llm serving under sla guarantees[C]//Proceedings of the30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2025:798-813.
[126]QIN Z,CAO Y,LIN M,et al.CAKE:Cascading and adaptiveKV cache eviction with layer preferences[J].arXiv:2503.12491,2025.
[127]YE Z,CHEN L,LAI R,et al.Flashinfer:Efficient and customi-zable attention engine for llm inference serving[J].arXiv:2501.01005,2025.
[128]WU J,WANG Z,ZHANG L,et al.SCOPE:Optimizing Key-Value Cache Compression in Long-context Generation[J].ar-Xiv:2412.13649,2024.
[129]WAN Z,SHEN H,WANG X,et al.Meda:Dynamic kv cache allocation for efficient multimodal long-context inference[J].ar-Xiv:2502.17599,2025.
[130]TRAN B,LI J,MADRY A.Spectral signatures in backdoor attacks[C]//Proceedings of the 32nd International Confernece on Journal of Machine Learning Research.2018.
[131]CHOWDHERY A,NARANG S,DEVLIN J,et al.Palm:Scaling language modeling with pathways[J].Journal of Machine Learning Research,2023,24(240):1-113.
[132]HUANG Y,CHENG Y,BAPNA A,et al.Gpipe:Efficient trai-ning of giant neural networks using pipeline parallelism[C]//Proceedings of the 33rd International Conference onNeural Information Processing Systems.2019:103-112.
[133]LI S,XUE F,BARANWAL C,et al.Sequence parallelism:Long sequence training from system perspective[J].arXiv:2105.13120,2021.
[134]ZHENG L,LI Z,ZHANG H,et al.Alpa:Automating inter-and {Intra-Operator} parallelism for distributed deep learning[C]//16th USENIX Symposium on Operating Systems Design and Implementation(OSDI 22).2022:559-578.
[135]JIA Z,ZAHARIA M,AIKEN A.Beyond data and model parallelism for deep neural networks[J].Proceedings of Machine Learning and Systems,2019,1:1-13.
[136]MIAO X,WANG Y,JIANG Y,et al.Galvatron:Efficient transformer training over multiple gpus using automatic parallelism[J].arXiv:2211.13878,2022.
[137]LI Z,ZHENG L,ZHONG Y,et al.{AlpaServe}:Statistical multiplexing with model parallelism for deep learning serving[C]//17th USENIX Symposium on Operating Systems Design and Implementation(OSDI 23).2023:663-679.
[138]LU W,YAN G,LI J,et al.Flexflow:A flexible dataflow accele-rator architecture for convolutional neural networks[C]//2017 IEEE International Symposium on High Performance Computer Architecture(HPCA).IEEE,2017:553-564.
[139]MIAO X,SHI C,DUAN J,et al.Spotserve:Serving generative large language models on preemptible instances[C]//Procee-dings of the 29th ACM International Conference on Architectu-ral Support for Programming Languages and Operating Systems,Volume 2.2024:1112-1127.
[140]BORZUNOV A,BARANCHUK D,DETTMERS T,et al.Pe-tals:Collaborative inference and fine-tuning of large models[J].arXiv:2209.01188,2022.
[141]WANG Y,XUE D,ZHANG S,et al.Badagent:Inserting and activating backdoor attacks in llm agents[J].arXiv:2406.03007,2024.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!