实际应用场景中的大模型高效推理技术综述

doi:10.11896/jsjkx.250300030

Abstract

Abstract: In recent years,the technologies of LLMs have been rapidly developed,with their applications across various industries experiencing vigorous growth.From natural language processing to intelligent recommendations,and from information retrieval to automated writing,LLMs are becoming indispensable tools in many fields.However,with the diversification of application scena-rios and the increase in demands,the efficiency of LLM inference is becoming increasingly prominent.In practical applications,ra-pid and accurate inference capabilities are crucial for responding to user queries,handling large-scale data,and making real-time decisions.To address this challenge,academia has undertaken extensive research and exploration to enhance the inference efficiency of LLMs.This paper comprehensively surveys the literature on efficient LLM inference in practical application scenarios.Firstly,it introduces the principles of LLMs and analyzes how to improve LLM inference efficiency in practical application scenarios.Secondly,it proposes a taxonomy tailored for real-world applications,which consists of three main levels:algorithm optimization,parameter optimization,and system optimization.This survey summarizes and categorizes related work about LLMs.Finally,it discusses potential future research directions.

Key words: Large language models, Efficient inference, Practical application scenarios, Algorithm optimization, Parameter optimization, System optimization

CLC Number:

TP391

LIU Lilong, LIU Guoming, QI Baoyuan, DENG Xueshan, XUE Dizhan, QIAN Shengsheng. Efficient Inference Techniques of Large Models in Real-world Applications:A Comprehensive Survey[J].Computer Science, 2026, 53(1): 12-28.

References

[1]CHANG Y P,WANG X,WANG J D,et al.A survey on evaluation of large language models[J].arXiv:2307.03109,2023.
[2]OpenAI.2023.Introducing ChatGPT[EB/OL].https://openai.com/blog/chatgpt.
[3]Microsoft.Announcing microsoft copilot,your everyday aicompanion[EB/OL].[2023-12-04].https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/.
[4]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[5]KACHRIS C.A survey on hardware accelerators for large language models[J].arXiv:2401.09890,2024.
[6]ZHU X,LI J,LIU Y,et al.A survey on model compression for large language models[J].arXiv:2308.07633,2023.
[7]SHAO R R,LIU Y,ZHANG W,et al.A Survey of Knowledge Distillation in Deep Learning[J].Journal of Computer Science,2022,45(8):1638-1673.
[8]HUANG Z H,YANG S Z,LIN W,et al.A Survey of Know-ledge Distillation[J].Journal of Computer Science,2022,45(3):624-653.
[9]PARK S,CHOI J,LEE S,et al.A comprehensive survey ofcompression algorithms for language models[J].arXiv:2401.15347,2024.
[10]KHOSHNOODI M,JAIN V,GAO M,et al.A comprehensive survey of accelerated generation techniques in large language models[J].arXiv:2405.13019,2024.
[11]WANG W,CHEN W,LUO Y,et al.Model compression and efficient inference for large language models:A survey[J].arXiv:2402.09748,2024.
[12]ZHOU Z,NING X,HONG K,et al.A survey on efficient inference for large language models[J].arXiv:2404.14294,2024.
[13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems(NIPS’17).2017:6000-6010.
[14]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[15]GAO Y,XIONG Y,GAO X,et al.Retrieval-augmented generation for large language models:A survey[J].arXiv:2312.10997,2023.
[16]ZHOU W,JIANG Y E,COTTERELL R,et al.Efficientprompting via dynamic in-context learning[J].arXiv:2305.11170,2023.
[17]YIN F.VIG J,LABAN P,et al.Did you read the instructions? rethinking the effectiveness of task defi nitions in instruction learning[J].arXiv:2306.01150,2023.
[18]JUNG H,KIM K J.Discrete prompt compression with rein-forcement learning[J].IEEE Access,2024,12:72578-72587.
[19]XU F,SHI W,CHOI E.Recomp:Improving retrieval-augmented lms with compression and selective augmentation[J].arXiv:2310.04408,2023.
[20]LISKAVETS B,ROY S,USHAKOV M,et al.Task-agnosticPrompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor[J].arXiv:2502.13374,2025.
[21]WINGATE D,SHOEYBI M,SORENSEN T.Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models[J].arXiv:2210.03162,2022.
[22]MU J,LI X,GOODMAN N.Learning to compress prompts with gist tokens[J].Advances in Neural Information Processing Systems,2023,36:19327-19352.
[23]CHEVALIER A,WETTIG A,AJITH A,et al.Adapting language models to compress contexts[J].arXiv:2305.14788,2023.
[24]GE T,HU J,WANG L,et al.In-context autoencoder for context compression in a large language model[J].arXiv:2307.06945,2023.
[25]WANG H,ZHANG Z,HAN S.Spatten:Efficient sparse attention architecture with cascade token and head pruning[C]//2021 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2021:97-110.
[26]KIM S,SHEN S,THORSLEY D,et al.Learned token pruning for transformers[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.2022:784-794.
[27]WANG Z,CHEN J,ZHOU W,et al.Smarttrim:Adaptive to-kens and attention pruning for efficient vision-language models[J].arXiv:2305.15033,2023.
[28]FEDERICI M,BELLI D,VAN BAALEN M,et al.Efficient llm inference using dynamic input pruning and cache-aware masking[J].arXiv:2412.01380,2024.
[29]JIANG Z,XU F F,GAO L,et al.Active retrieval augmentedgeneration[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:7969-7992.
[30]SHI W,MIN S,YASUNAGA M,et al.Replug:Retrieval-aug-mented black-box language models[J].arXiv:2301.12652,2023.
[31]ASAI A,WU Z,WANG Y,et al.Self-rag:Learning to retrieve,generate,and critique through self-reflection[J].arXiv:2310.11511,2024.
[32]XIN J,TANG R,LEE J,et al.DeeBERT:Dynamic early exiting for accelerating BERT inference[J].arXiv:2004.12993,2020.
[33]SCHWARTZ R,STANOVSKY G,SWAYAMDIPTA S,et al.The right tool for the job:Matching model and instance complexities[J].arXiv:2004.07453,2020.
[34]ZHOU W,XU C,GE T,et al.Bert loses patience:Fast and robust inference with early exit[J].Advances in Neural Information Processing Systems,2020,33:18330-18341.
[35]ZHANG Z,ZHU W,ZHANG J,et al.PCEE-BERT:Accelerating BERT inference via patient and confident early exiting[C]//Findings of the Association for Computational Linguistics:NAACL 2022.2022:327-338.
[36]WANG J,CHEN K,CHEN G,et al.Skipbert:Efficient infe-rence with shallow layer skipping[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:7287-7301.
[37]DIN A Y,KARIDI T,CHOSHEN L,et al.Jump to conclusions:Short-cutting transformers with linear transformations[J].ar-Xiv:2303.09435,2023.
[38]SCHUSTER T,FISCH A,GUPTA J,et al.Confident adaptive language modeling[J].Advances in Neural Information Proces-sing Systems,2022,35:17456-17472.
[39]TANG S,WANG Y,KONG Z,et al.You need multiple exiting:Dynamic early exiting for accelerating unifiedvision language model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:10781-10791.
[40]XU G,HAO J,SHEN L,et al.Lgvit:Dynamic early exiting for accelerating vision transformer[C]//Proceedings of the 31st ACM International Conference on Multimedia.2023:9103-9114.
[41]MEISTER C,VIEIRA T,COTTERELL R.Best-first beamsearch[J].Transactions of the Association for Computational Linguistics,2020,8:795-809.
[42]FAN A,LEWIS M,DAUPHIN Y.Hierarchical neural storygeneration[J].arXiv:1805.04833,2018.
[43]HOLTZMAN A,BUYS J,DU L,et al.The curious case of neural text degeneration[J].arXiv:1904.09751,2019.
[44]WANG X,XIONG Y,WEI Y,et al.LightSeq:A high perfor-mance inference library for transformers[J].arXiv:2010.13887,2020.
[45]LI L,LIN Y,CHEN D,et al.Cascadebert:Accelerating infe-rence of pre-trained language models via calibrated complete models cascade[J].arXiv:2012.14682,2020.
[46]WANG Y,CHEN K,TAN H,et al.Tabi:An efficient multi-le-vel inference system for large language models[C]//Proceedings of the Eighteenth European Conference on Computer Systems.2023:233-248.
[47]CHEN L,ZAHARIA M,ZOU J.Frugalgpt:How to use large language models while reducing cost and improving performance[J].arXiv:2305.05176,2023.
[48]YUE M,ZHAO J,ZHANG M,et al.Large language model cascades with mixture of thoughts representations for cost-efficient reasoning[J].arXiv:2310.03094,2023.
[49]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-thoughtprompting elicits reasoning in large language models[J].Advances in neural information processing systems,2022,35:24824-24837.
[50]CHEN W,MA X,WANG X,et al.Program of thoughts promp-ting:Disentangling computation from reasoning for numerical reasoning tasks[J].arXiv:2211.12588,2022.
[51]SHAZEER N,MIRHOSEINI A,MAZIARZ K,et al.Outra-geously large neural networks:The sparsely-gated mixture-of-experts layer[J].arXiv:1701.06538, 2017.
[52]XIA H,YANG Z,DONG Q,et al.Unlocking efficiency in large language model inference:A comprehensive survey of speculative decoding[J].arXiv:2401.07851,2024.
[53]ZHOU Y,LYU K,RAWAT A S,et al.Distillspec:Improvingspeculative decoding via knowledge distillation[J].arXiv:2310.08461,2023.
[54]ZHANG J,WANG J,LI H,et al.Draft & verify:Lossless large language model acceleration via self-speculative decoding[J].arXiv:2309.08168,2023.
[55]LIU X,HU L,BAILIS P,et al.Online speculative decoding[J].arXiv:2310.07177,2023.
[56]MONEA G,JOULIN A,GRAVE E.Pass:Parallel speculative sampling[J].arXiv:2311.13581,2023.
[57]HE Z,ZHONG Z,CAI T,et al.Rest:Retrieval-based speculative decoding[J].arXiv:2311.08252,2023.
[58]MIAO X,OLIARO G,ZHANG Z,et al.Specinfer:Accelerating generative large language model serving with tree-based speculative inference and verification[J].arXiv:2305.09781,2023.
[59]FU Y,BAILIS P,STOICA I,et al.Break the sequential depen-dency of llm inference using lookahead decoding[J].arXiv:2402.02057,2024.
[60]CAI T,LI Y,GENG Z,et al.Medusa:Simple llm inference acceleration framework with multiple decoding heads[J].arXiv:2401.10774,2024.
[61]LI Y,ZHANG C,ZHANG H.Eagle:Lossless acceleration of llm decoding by feature extrapolation[EB/OL].[2023-12-08].https://sites.google.com/view/eagle-llm.
[62]SUN Z,SURESH A T,RO J H,et al.Spectr:Fast speculative decoding via optimal transport[J].Advances in Neural Information Processing Systems,2023,36:30222-30242.
[63]LI S,CHEN J,SHEN Y,et al.Explanations from large language models make small reasoners better[J].arXiv:2210.06726,2022.
[64]YANG G,LO D,MULLINS R,et al.Dynamic stashing quantization for efficient transformer training[J].arXiv:2303.05295,2023.
[65]CHENG Y,WANG D,ZHOU P,et al.A survey of model compression and acceleration for deep neural networks[J].arXiv:1710.09282,2017.
[66]FRANTAR E,ASHKBOOS S,HOEFLER T,et al.GPTQ:Accurate quantization for generative pre-trained transformers[C]//The Eleventh International Conference on Learning Representations.2022.
[67]PARK G,PARK B,KIM M,et al.Lut-gemm:Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models[J].arXiv:2206.09557,2022.
[68]LIN J,TANG J,TANG H,et al.AWQ:Activation-awareWeight Quantization for On-Device LLM Compression and Acceleration[J].Proceedings of Machine Learning and Systems,2024,6:87-100.
[69]KIM S,HOOPER C,GHOLAMI A,et al.Squeezellm:Dense-and-sparse quantization[J].arXiv:2306.07629,2023.
[70]YAO Z,WU X,LI C,et al.Zeroquant-v2:Exploring post-trai-ning quantization in llms from comprehensive study to low rank compensation[J].arXiv:2303.08302,2023.
[71]DETTMERS T,LEWIS M,BELKADA Y,et al.Gpt3.int8():8-bit matrix multiplication for transformers at scale[J].Advances in Neural Information Processing Systems,2022,35:30318-30332.
[72]XIAO G,LIN J,SEZNEC M,et al.Smoothquant:Accurate and efficient post-training quantization for large language models[C]//International Conference on Machine Learning.PMLR,2023:38087-38099.
[73]YUAN Z,NIU L,LIU J,et al.Rptq:Reorder-based post-trai-ning quantization for large language models[J].arXiv:2304.01089,2023.
[74]YAO Z,YAZDANI AMINABADI R,ZHANG M,et al.Zeroquant:Efficient and affordable post-training quantization for large-scale transformers[J].Advances in Neural Information Processing Systems,2022,35:27168-27183.
[75]LIU Z,OGUZ B,ZHAO C,et al.Llm-qat:Data-free quantization aware training for large language models[C]//Findings of the Association for Computational Linguistics:ACL 2024.2024:467-484.
[76]SAXENA U,SHARIFY S,ROY K,et al.ResQ:Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals[J].arXiv:2412.14363,2024.
[77]ZENG B,JI B,LIU X,et al.LSAQ:Layer-Specific AdaptiveQuantization for Large Language Model Deployment[J].arXiv:2412.18135,2024.
[78]LIU S,LIU Z,HUANG X,et al.Llm-fp4:4-bit floating-pointquantized transformers[J].arXiv:2310.16836,2023.
[79]KIM J,LEE J H,KIM S,et al.Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization[J].arXiv:2305.14152,2024.
[80]DETTMERS T,PAGNONI A,HOLTZMAN A,et al.Qlora:Efficient finetuning of quantized LLMs[J].arXiv:2305.14314,2024.
[81]HU E J,SHEN Y,WALLIS P,et al.Lora:Low-rank adaptation of large language models[J].ICLR,2022,1(2):3.
[82]LI L,LI Q,ZHANG B,et al.Norm tweaking:High-performance low-bit quantization of large language models[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:18536-18544.
[83]LI Y,XU S,ZHANG B,et al.Q-vit:Accurate and fully quantized low-bit vision transformer[J].Advances in neural information processing systems,2022,35:34451-34463.
[84]FRANTAR E,ALISTARH D.Sparsegpt:Massive languagemodels can be accurately pruned in one-shot[C]//International Conference on Machine Learning.PMLR,2023:10323-10337.
[85]SUN M,LIU Z,BAIR A,et al.A simple and effective pruning approach for large language models[J].arXiv:2306.11695,2023.
[86]ZHANG M,CHEN H,SHEN C,et al.Loraprune:Pruning meetslow-rank parameter-efficient fine-tuning[J].arXiv:2305.18403,2023.
[87]CUNEGATTI E,CUSTODE L L,IACCA G.Zeroth-OrderAdaptive Neuron Alignment Based Pruning without Re-Training[J].arXiv:2411.07066,2024.
[88]MA X,FANG G,WANG X.Llm-pruner:On the structuralpruning of large language models[J].Advances in neural information processing systems,2023,36:21702-21720.
[89]GORDON A,EBAN E,NACHUM O,et al.Morphnet:Fast & simple resource-constrained structure learning ofdeep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:1586-1595.
[90]YANG T J,HOWARD A,CHEN B,et al.Netadapt:Platform-aware neural network adaptation for mobile applications[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:285-300.
[91]VAN DER OUDERAA T F A,NAGEL M,VAN BAALEN M,et al.The Surgeon[J].arXiv:2312.17244,2023.
[92]LEE J,KIM H.DCT-ViT:High-Frequency Pruned VisionTransformer with Discrete Cosine Transform[J].IEEE Access,2024,12:80386-80396.
[93]YU L,XIANG W.X-pruner:explainable pruning for visiontransformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:24355-24363.
[94]SANDRI F,CUNEGATTI E,IACCA G.2SSP:A Two-StageFramework for Structured Pruning of LLMs[J].arXiv:2501.17771,2025.
[95]MUÑOZ J P,YUAN J,JAIN N.Mamba-Shedder:Post-Transformer Compression for Efficient Selective Structured State Space Models[J].arXiv:2501.17088,2025.
[96]MUÑOZ J P,YUAN J,JAIN N.Multipruner:Balanced structure removal in foundation models[J].arXiv:2501.09949,2025.
[97]SORBER L,VAN BAREL M,DE LATHAUWER L.Optimization-based algorithms for tensor decompositions:Canonical polyadic decomposition,decomposition in rank-(L_r,L_r,1) terms,and a new generalization[J].SIAM Journal on Optimization,2013,23(2):695-720.
[98]MØRUP M,HANSEN L K,ARNFRED S M.Algorithms forsparse nonnegative Tucker decompositions[J].Neural computation,2008,20(8):2112-2131.
[99]SAHA R,SRIVASTAVA V,PILANCI M.Matrix compression via randomized low rank and low precision factorization[J].ar-Xiv:2310.11028,2023.
[100]KAUSHAL A,VAIDHYA T,RISH I.Lord:Low rank decomposition of monolingual code llms for one-shot compression[J].arXiv:2309.14021,2023.
[101]WANG X,ZHENG Y,WAN Z,et al.Svd-llm:Truncation-aware singular value decomposition for large language model compression[J].arXiv:2403.07378,2024.
[102]CHANG C C,SUNG Y Y,YU S,et al.FLORA:Fine-grained Low-Rank Architecture Search for Vision Transformer[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2024:2482-2491.
[103]BUCILUA C,CARUANA R,NICULESCU-MIZIL A.Modelcompression[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2006:535-541.
[104]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[105]DONG Q,LI L,DAI D,et al.A survey on in-context learning[J].arXiv:2301.00234,2022.
[106]HUANG Y,CHEN Y,YU Z,et al.In-context learning distillation:Transferring few-shot learning ability of pre-trained language models[J].arXiv:2212.10670,2022.
[107]GOYAL V,KHAN M,TIRUPATI A,et al.Enhancing Know-ledge Distillation for LLMs with Response-Priming Prompting[J].arXiv:2412.17846,2024.
[108]ZHOU Z,SHI J X,SONG P X,et al.LawGPT:A Chinese Legal Knowledge-Enhanced Large Language Model[J].arXiv:2406.04614,2024.
[109]CHEN Z,GAO Q,BOSSELUT A,et al.DISCO:Distilling counterfactuals with large language models[J].arXiv:2212.10534,2022.
[110]JIANG Y,CHAN C,CHEN M,et al.Lion:Adversarial distil-lation of proprietary large language models[J].arXiv:2305.12870,2023.
[111]GU Y,DONG L,WEI F,et al.Knowledge distillation of large language models[J].arXiv:2306.08543,2023.
[112]AGARWAL R,VIEILLARD N,STANCZYK P,et al.Gkd:Generalized knowledge distillation for auto-regressive sequence models[J].arXiv:2306.13649,2023.
[113]LIANG C,ZUO S,ZHANG Q,et al.Less is more:Task-aware layer-wise distillation for language model compression[C]//International Conference on Machine Learning.PMLR,2023:20852-20867.
[114]ZHANG C,YANG Y,LIU J,et al.Lifting the curse of capacity gap in distilling language models[J].arXiv:2305.12129,2023.
[115]CHEN X,CAO Q,ZHONG Y,et al.Dearkd:data-efficient early knowledge distillation for vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12052-12062.
[116]RADEVSKI G,GRUJICIC D,BLASCHKO M,et al.Multimodal distillation for egocentric action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:5213-5224.
[117]SHENG Y,ZHENG L,YUAN B,et al.Flexgen:High-throughput generative inference of large language models with a single gpu[C]//International Conference on Machine Learning.PMLR,2023:31094-31116.
[118]YU G I,JEONG J S,KIM G W,et al.Orca:A distributed serving system for {Transformer-Based} generative models[C]//16th USENIX Symposium on Operating Systems Design and Implementation(OSDI 22).2022:521-538.
[119]JIN Y,WU C F,BROOKS D,et al.$ S^∧ 3$:Increasing GPU Utilization during Generative Inference for Higher Throughput[J].Advances in Neural Information Processing Systems,2023,36:18015-18027.
[120]KWON W,LI Z,ZHUANG S,et al.Efficient memory management for large language model serving with pagedattention[C]//Proceedings of the 29th Symposium on Operating Systems Principles.2023:611-626.
[121]LIU J,CHUNG J W,WU Z,et al.Andes:Defining and enhancing quality-of-experience in llm-based text streaming services[J].arXiv:2404.16283,2024.
[122]AMINABADI R Y,RAJBHANDARI S,AWAN A A,et al.Deepspeed-inference:enabling efficient inference of transformer models at unprecedented scale[C]//SC22:International Confe-rence for High Performance Computing,Networking,Storage and Analysis.IEEE,2022:1-15.
[123]DAO T,HAZIZA D,MASSA F,et al.Flash-decoding for long-context inference.[EB/OL].[2023-10-13].https://pytorch.org/blog/flash-decoding/.
[124]HONG K,DAI G,XU J,et al.Flashdecoding++:Faster large language model inference on gpus[J].arXiv:2311.01282,2023.
[125]GONG R,BAI S,WU S,et al.Past-future scheduler for llm serving under sla guarantees[C]//Proceedings of the30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2025:798-813.
[126]QIN Z,CAO Y,LIN M,et al.CAKE:Cascading and adaptiveKV cache eviction with layer preferences[J].arXiv:2503.12491,2025.
[127]YE Z,CHEN L,LAI R,et al.Flashinfer:Efficient and customi-zable attention engine for llm inference serving[J].arXiv:2501.01005,2025.
[128]WU J,WANG Z,ZHANG L,et al.SCOPE:Optimizing Key-Value Cache Compression in Long-context Generation[J].ar-Xiv:2412.13649,2024.
[129]WAN Z,SHEN H,WANG X,et al.Meda:Dynamic kv cache allocation for efficient multimodal long-context inference[J].ar-Xiv:2502.17599,2025.
[130]TRAN B,LI J,MADRY A.Spectral signatures in backdoor attacks[C]//Proceedings of the 32nd International Confernece on Journal of Machine Learning Research.2018.
[131]CHOWDHERY A,NARANG S,DEVLIN J,et al.Palm:Scaling language modeling with pathways[J].Journal of Machine Learning Research,2023,24(240):1-113.
[132]HUANG Y,CHENG Y,BAPNA A,et al.Gpipe:Efficient trai-ning of giant neural networks using pipeline parallelism[C]//Proceedings of the 33rd International Conference onNeural Information Processing Systems.2019:103-112.
[133]LI S,XUE F,BARANWAL C,et al.Sequence parallelism:Long sequence training from system perspective[J].arXiv:2105.13120,2021.
[134]ZHENG L,LI Z,ZHANG H,et al.Alpa:Automating inter-and {Intra-Operator} parallelism for distributed deep learning[C]//16th USENIX Symposium on Operating Systems Design and Implementation(OSDI 22).2022:559-578.
[135]JIA Z,ZAHARIA M,AIKEN A.Beyond data and model parallelism for deep neural networks[J].Proceedings of Machine Learning and Systems,2019,1:1-13.
[136]MIAO X,WANG Y,JIANG Y,et al.Galvatron:Efficient transformer training over multiple gpus using automatic parallelism[J].arXiv:2211.13878,2022.
[137]LI Z,ZHENG L,ZHONG Y,et al.{AlpaServe}:Statistical multiplexing with model parallelism for deep learning serving[C]//17th USENIX Symposium on Operating Systems Design and Implementation(OSDI 23).2023:663-679.
[138]LU W,YAN G,LI J,et al.Flexflow:A flexible dataflow accele-rator architecture for convolutional neural networks[C]//2017 IEEE International Symposium on High Performance Computer Architecture(HPCA).IEEE,2017:553-564.
[139]MIAO X,SHI C,DUAN J,et al.Spotserve:Serving generative large language models on preemptible instances[C]//Procee-dings of the 29th ACM International Conference on Architectu-ral Support for Programming Languages and Operating Systems,Volume 2.2024:1112-1127.
[140]BORZUNOV A,BARANCHUK D,DETTMERS T,et al.Pe-tals:Collaborative inference and fine-tuning of large models[J].arXiv:2209.01188,2022.
[141]WANG Y,XUE D,ZHANG S,et al.Badagent:Inserting and activating backdoor attacks in llm agents[J].arXiv:2406.03007,2024.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Efficient Inference Techniques of Large Models in Real-world Applications:A Comprehensive Survey

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	SHAO Xinyi, ZHU Jingwei, ZHANG Liang. LLM-based Business Process Adaptation Method to Respond Long-tailed Changes [J]. Computer Science, 2026, 53(1): 29-38.
[2]	LI Maolin, LIN Jiajie, YANG Zhenguo. Confidence-guided Prompt Learning for Multimodal Aspect-level Sentiment Analysis [J]. Computer Science, 2025, 52(7): 241-247.
[3]	CHEN Jinyin, XI Changkun, ZHENG Haibin, GAO Ming, ZHANG Tianxin. Survey of Security Research on Multimodal Large Language Models [J]. Computer Science, 2025, 52(7): 315-341.
[4]	LI Bo, MO Xian. Application of Large Language Models in Recommendation System [J]. Computer Science, 2025, 52(6A): 240400097-7.
[5]	HU Caishun. Study on Named Entity Recognition Algorithms in Audit Domain Based on Large LanguageModels [J]. Computer Science, 2025, 52(6A): 240700190-4.
[6]	GU Huijie, FANG Wenchong, ZHOU Zhifeng, ZHU Wen, MA Guang, LI Yingchen. CSO-LSTM Based Power Prediction Method for New Energy Generation [J]. Computer Science, 2025, 52(6A): 240600053-11.
[7]	GAO Hongkui, MA Ruixiang, BAO Qihao, XIA Shaojie, QU Chongxiao. Research on Hybrid Retrieval-augmented Dual-tower Model [J]. Computer Science, 2025, 52(6): 324-329.
[8]	LI Hao, YANG Yumeng, ZHAO Boyang, ZHENG Puqi, LIN Hongfei. Adverse Drug Reaction Relationship Extraction Based on Chain of Thought Enhancement UnderHigh and Low Resources [J]. Computer Science, 2025, 52(12): 224-230.
[9]	XU Fuping, ZHOU Xiaohang, ZHANG Ning. Review of Impact of Personalized Recommendation Algorithms on User Decision-makingBehavior [J]. Computer Science, 2025, 52(11A): 241100086-8.
[10]	GUO Liwei, WU Yonghao, LIU Yong. Semantic Variations Based Defect Generation and Prediction Model Testing [J]. Computer Science, 2025, 52(11A): 241200059-7.
[11]	HUANG Haixin, XU Chenglong, FU Yao. Research on Structured Pruning Algorithm Based on Information Fusion [J]. Computer Science, 2025, 52(11A): 241000041-6.
[12]	PAN Jie, WANG Juan, WANG Nan. Large Language Models and Rumors:A Survey on Generation and Detection [J]. Computer Science, 2025, 52(11): 1-12.
[13]	FANG Quan, ZHANG Jinlong, WANG Bingqian, HU Jun. Research on Domain Knowledge Question Answering via Large Language Models withCompositional Context Prompting [J]. Computer Science, 2025, 52(11): 13-21.
[14]	ZHANG Haoran, HAO Wenning, JIN Dawei, CHENG Kai, ZHAI Ying. DF-RAG:A Retrieval-augmented Generation Method Based on Query Rewriting and Knowledge Selection [J]. Computer Science, 2025, 52(11): 30-39.
[15]	ZHOU Yuchen, LI Peng, HAN Keji. Instruct-Malware:Control Flow Graph Based Large Language Model Analysis of Malware [J]. Computer Science, 2025, 52(11): 40-48.