Computer Science ›› 2026, Vol. 53 ›› Issue (6): 102-116.doi: 10.11896/jsjkx.251000119

• High Performance Computing • Previous Articles     Next Articles

Review on Parallel Training and Inference of Diffusion Models

ZHU Huming, LIU Huijie, DONG Ximiao, CHEN Zhipeng, GAO Tianqi, JIAO Licheng   

  1. School of Artificial Intelligence,Xidian University,Xi'an 710071,China
    Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Educatio,Xi'an 710071,China
  • Received:2025-10-27 Revised:2026-01-08 Online:2026-06-15 Published:2026-06-09
  • About author:ZHU Huming,born in 1978,Ph.D,associate professor,doctoral supervisor.His main research interests include high-performance computing and artificial intelligence computing systems.
  • Supported by:
    Key Research and Development Program of Shaanxi(2022ZDLGY01-09).

Abstract: Diffusion models(DMs) have demonstrated outstanding generation quality and controllability in image and video gene-ration tasks.However,they still face significant system-level performance bottlenecks in large-scale training and inference scena-rios.Starting from the fundamental principles and modeling paradigms of diffusion models,and considering the computational cha-racteristics of denoising networks,this paper systematically analyzes the challenges encountered during both training and infe-rence,and summarizes the parallel optimization strategies and distributed training frameworks adopted by mainstream open-source diffusion models.The analysis shows that the training stage is mainly constrained by high memory consumption,long training time,and insufficient computational efficiency,while the inference stage suffers from redundant computation along the time-step dimension and high inference latency.To address these bottlenecks,this paper reviews and compares various optimizationmethods,including data parallelism,tensor parallelism,sequence parallelism,pipeline parallelism,and time-step parallelism,and systematically analyzes their applicability and potential efficiency gains in terms of memory optimization,communication cost reduction,and computation-communication overlap.Based on open-source technical reports and experimental results,the study demonstrates that parallel optimization can significantly reduce memory overhead and improve inference speed.Furthermore,the parallel support characteristics of mainstream diffusion model inference frameworks are investigated,revealing potential future directions in multi-node inference,dynamic scheduling,and mixture-of-experts parallelism.This study provides a systematic refe-rence for efficient training and inference of diffusion models and is of significant importance for performance optimization and distributed deployment of large-scale generative models.

Key words: Diffusion models, Parallel training, Distributed computing, Inference acceleration, Timestep parallelism

CLC Number: 

  • TP391
[1]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[2]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27:2672-2680.
[3]KINGMA D P,WELLING M.Auto-encoding variational bayes[M].Banff,2013.
[4]LIU Y,ZHANG K,LI Y,et al.Sora:A review on background,technology,limitations,and opportunities of large vision models[J].arXiv:2402.17177,2024.
[5]MA G,HUANG H,YAN K,et al.Step-video-t2v technical report:The practice,challenges,and future of video foundation model[J].arXiv:2502.10248,2025.
[6]BAO F,XIANG C,YUE G,et al.Vidu:A highly consistent,dynamic and skilled text-to-video generator with diffusion models[J].arXiv:2405.04233,2024.
[7]XUE J,DENG Y,GAO Y,et al.Auffusion:Leveraging the Po-wer of Diffusion and Large Language Models for Text-to-Audio Generation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2024,32:4700-4712.
[8]KONG Z,PING W,HUANG J,et al.Diffwave:A versatile diffusion model for audio synthesis[J].arXiv:2009.09761,2020.
[9]LUO S,HU W.Diffusion probabilistic models for 3D point cloud generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:2837-2845.
[10]VARADI M,ANYANGO S,DESHPANDE M,et al.AlphaFold Protein Structure Database:massively expanding the structural coverage of protein-sequence space with high-accuracy models[J].Nucleic Acids Research,2022,50(D1):D439-D444.
[11]WANG J,PENG W,TANG J,et al.Act to See,See to Act:Diffusion-Driven Perception-Action Interplay for Adaptive Policies[J].arXiv:2509.25822,2025.
[12]SONG J,MENG C,ERMON S.Denoising Diffusion ImplicitModels[C]//International Conference on Learning Representations(ICLR).2021.
[13]LU C,ZHOU Y,BAO F,et al.DPM-Solver:A fast ODE solver for diffusion probabilistic model sampling in around 10 steps[J].Advances in Neural Information Processing Systems,2022,35:5775-5787.
[14]SONG Y,SOHL-DICKSTEIN J,KINGMA D P,et al.Score-Based Generative Modeling through Stochastic Differential Equations[C]//International Conference on Learning Representations(ICLR).2021.
[15]ESSER P,KULAL S,BLATTMANN A,et al.Scaling rectified flow transformers for high-resolution image synthesis[C]//Forty-first International Conference on Machine Learning.2024.
[16]HO J,SALIMANS T.Classifier-free diffusion guidance[J].ar-Xiv:2207.12598,2022.
[17]RUIZ N,LI Y,JAMPANI V,et al.Dreambooth:Fine tuning text-to-image diffusion models for subject-driven generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:22500-22510.
[18]ZHANG L,RAO A,AGRAWALA M.Adding conditional control to text-to-image diffusion models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:3836-3847.
[19]SONG Y,ERMON S.Generative modeling by estimating gradients of the data distribution[J].Advances in Neural Information Processing Systems,2019,32:11918-11930
[20]ROMBACH R,BLATTMANN A,LORENZ D,et al.High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:10684-10695.
[21]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[22]ZHU H,WU W,ZHU W,et al.Celebv-HQ:A large-scale video facial attributes dataset[C]//European Conference on Computer Vision.Springer,2022:650-667.
[23]PENG X,ZHENG Z,SHEN C,et al.Open-Sora 2.0:Training a commercial-level video generation model in 200k GPU hours[J].arXiv:2503.09642,2025.
[24]YANG Z,TENG J,ZHENG W,et al.CogVideoX:Text-to-video diffusion models with an expert transformer[J].arXiv:2408.06072,2024.
[25]CAO H,TAN C,GAO Z,et al.A survey on generative diffusion models[J].IEEE Transactions on Knowledge and Data Engineering,2024,36(6):2607-2631.
[26]YANG L,ZHANG Z,SONG Y,et al.Diffusion models:A comprehensive survey of methods and applications[J].ACM Computing Surveys,2023,56(4):1-39.
[27]YE H,LIN H,HAN J,et al.TFG:Unified training-free guidance for diffusion models[J].Advances in Neural Information Processing Systems,2024,37:22370-22417.
[28]MA Z,ZHANG Y,JIA G,et al.Efficient Diffusion Models:A Comprehensive Survey from Principles to Practices[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2025(1):1-20.
[29]ZHU H M,LI P,JIAO L C,et al.Review of parallel deep neural network[J].Chinese Journal of Computers,2018,41(8):1861-1881.
[30]LIANG P,TANG Y,ZHANG X,et al.A survey on auto-parallelism of large-scale deep learning training[J].IEEE Transactions on Parallel and Distributed Systems,2023,34(8):2377-2390.
[31]ZHAO H Y,LI Z K,QIAN S Y,et al.GPU Performance Characterization in Distributed Systems:Survey and Research Directions[J].Journal of Chinese Computer Systems,2026,47(1):58-72.
[32]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//International Conference on Medical image Computing and Computer-Assisted Intervention.Cham:Springer,2015:234-241.
[33]ZAGORUYKO S,KOMODAKIS N.Wide residual networks[J].arXiv:1605.07146,2016.
[34]DHARIWAL P,NICHOL A.Diffusion models beat GANs on image synthesis[J].Advances in Neural Information Processing Systems,2021,34:8780-8794.
[35]RAMESH A,DHARIWAL P,NICHOL A,et al.Hierarchical text-conditional image generation with CLIP latents[J].arXiv:2204.06125,2022.
[36]BETKER J,GOH G,JING L,et al.Improving image generation with better captions[EB/OL].https://cdn.openai.com/papers/dall-e-3.pdf.
[37]SAHARIA C,CHAN W,SAXENA S,et al.Photorealistic text-to-image diffusion models with deep language understanding[J].Advances in Neural Information Processing Systems,2022,35:36479-36494.
[38]PODELL D,ENGLISH Z,LACEY K,et al.SDXL:Improving latent diffusion models for high-resolution image synthesis[J].arXiv:2307.01952,2023.
[39]BAI J,BAI S,CHU Y,et al.Qwen technical report[J].arXiv:2309.16609,2023.
[40]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[41]BAO F,NIE S,XUE K,et al.All are worth words:A ViT backbone for diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:22669-22679.
[42]PEEBLES W,XIE S.Scalable diffusion models with transfor-mers[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2023:4195-4205.
[43]CHEN J,GE C,XIE E,et al.PixArt-σ:Weak-to-strong training of diffusion transformer for 4K text-to-image generation[C]//European Conference on Computer Vision.Springer,2024:74-91.
[44]TEAM G.Mochi 1[EB/OL].https://github.com/genmoai/models.
[45]KONG W,TIAN Q,ZHANG Z,et al.HunyuanVideo:A systematic framework for large video generative models[J].arXiv:2412.03603,2024.
[46]WU B,ZOU C,LI C,et al.HunyuanVideo 1.5 Technical Report[J].arXiv:2511.18870,2025.
[47]MA X,WANG Y,JIA G,et al.Latte:Latent diffusion trans-former for video generation[J].arXiv:2401.03048,2024.
[48]WAN T,WANG A,AI B,et al.Wan:Open and advanced large-scale video generative models[J].arXiv:2503.20314,2025.
[49]GAO Y,GUO H,HOANG T,et al.Seedance 1.0:Exploring the boundaries of video generation models[J].arXiv:2506.09113,2025.
[50]BYTE DANCE SEED TEAM.Seedance 1.5 pro:A Native Audio-Visual Joint Generation Foundation Model[J].arXiv:2512.13507,2025.
[51]MA N,GOLDSTEIN M,ALBERGO M S,et al.SIT:Exploring flow and diffusion-based generative models with scalable interpolant transformers[C]//European Conference on Computer Vision.Springer,2024:23-40.
[52]LIN S,WANG A,YANG X.SDXL-Lightning:Progressive adversarial diffusion distillation[J].arXiv:2402.13929,2024.
[53]LI Z,ZHANG J,LIN Q,et al.Hunyuan-DiT:A powerful multi-resolution diffusion transformer with fine-grained Chinese understanding[J].arXiv:2405.08748,2024.
[54]LI S,ZHAO Y,VARMA R,et al.PyTorch Distributed:Experiences on accelerating data parallel training[J].arXiv:2006.15704,2020.
[55]ZHAO Y,GU A,VARMA R,et al.PyTorch FSDP:Experiences on scaling fully sharded data parallel[J].arXiv:2304.11277,2023.
[56]WILLIAMS S W,WATERMAN A,PATTERSON D A.Roofline:An insightful visual performance model for floating-point programs and multicore architectures:Technical Report UCB/EECS-2008-134 [R].Berkeley:EECS Department,University of California,2008.
[57]YUAN Z,SHANG Y,ZHOU Y,et al.LLM inference unveiled:Survey and roofline model insights[J].arXiv:2402.16363,2024.
[58]YUAN Z,ZHANG H,PU L,et al.DiT-FastAttn:Attentioncompression for diffusion transformer models[J].Advances in Neural Information Processing Systems,2024,37:1196-1219.
[59]ZHAO X,JIN X,WANG K,et al.Real-time video generation with pyramid attention broadcast[J].arXiv:2408.12588,2024.
[60]LI M,CAI T,CAO J,et al.DistriFusion:Distributed parallel inference for high-resolution diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:7183-7193.
[61]FANG J,PAN J,SUN X,et al.XDiT:An inference engine for diffusion transformers(DiTs) with massive parallelism[J].ar-Xiv:2411.01738,2024.
[62]TEAM V.VideoSys:An easy and efficient system for video ge-neration[EB/OL].https://github.com/NUS-HPC-AI-Lab/VideoSys.
[63]DUAN J,ZHANG S,WANG Z,et al.Efficient training of large language models on distributed infrastructures:a survey[J].arXiv:2407.20018,2024.
[64]ZHANG Z,ZHENG S,WANG Y,et al.MiCS:Near-linear sca-ling for training gigantic model on public cloud[J].Advances in Neural Information Processing Systems,2022,35:39708-39720.
[65]RAJBHANDARI S,RASLEY J,RUWASE O,et al.ZeRO:Memory optimizations toward training trillion parameter mo-dels[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-16.
[66]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.GANs trained by a two time-scale update rule converge to a local Nash equilibrium[J].Advances in Neural Information Processing Systems,2017,30:6626-6637.
[67]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-LM:Training multi-billion parameter language models using model parallelism[J].arXiv:1909.08053,2019.
[68]XU Q,YOU Y.An efficient 2D method for training super-large deep learning models[C]//2023 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2023:222-232.
[69]BIAN Z,XU Q,WANG B,et al.Maximizing parallelism in distributed training for huge neural networks[J].arXiv:2105.14450,2021.
[70]HUANG Y,CHENG Y,BAPNA A,et al.GPipe:Efficient trai-ning of giant neural networks using pipeline parallelism[J].Advances in Neural Information Processing Systems,2019,32:103-112.
[71]NARAYANAN D,HARLAP A,PHANISHAYEE A,et al.PipeDream:Generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles.2019:1-15.
[72]NARAYANAN D,SHOEYBI M,CASPER J,et al.Efficientlarge-scale language model training onGPU clusters using megatron-lm[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-15.
[73]LI Z,ZHUANG S,GUO S,et al.TeraPipe:Token-level pipeline parallelism for training large-scale language models[C]//International Conference on Machine Learning.PMLR,2021:6543-6552.
[74]TIAN Y,JIA Z,LUO Z,et al.DiffusionPipe:Training large diffusion models with efficient pipelines[J].Proceedings of Machine Learning and Systems,2024,6:101-113.
[75]KORTHIKANTI V A,CASPER J,LYM S,et al.Reducing activation recomuputation in large transformer models[J].Procee-dings of Machine Learning and Systems,2023,5:341-353.
[76]LIU H,ZAHARIA M,ABBEEL P.Ring Attention with blockwise transformers for near-infinite context[J].arXiv:2310.01889,2023.
[77]DAO T.FlashAttention-2:Faster attention with better paralle-lism and work partitioning[J].arXiv:2307.08691,2023.
[78]DAO T,FU D,ERMON S,et al.FlashAttention:Fast and memory-efficient exact attention with IO-awareness[J].Advances in Neural Information Processing Systems,2022,35:16344-16359.
[79]JACOBS S A,TANAKA M,ZHANG C,et al.DeepSpeed Ulysses:System optimizations for enabling training of extreme long sequence transformer models[J].arXiv:2309.14509,2023.
[80]SHAZEER N.Fast transformer decoding:One write-head is all you need[J].arXiv:1911.02150,2019.
[81]AINSLIE J,LEE-THORP J,DE JONG M,et al.GQA:Training generalized multi-query transformer models from multi-head checkpoints[J].arXiv:2305.13245,2023.
[82]FANG J,ZHAO S.USP:A unified sequence parallelism ap-proach for long context generative AI[J].arXiv:2405.07719,2024.
[83]ZHAO X,CHENG S,CHEN C,et al.DSP:Dynamic sequenceparallelism for multi-dimensional transformers[J].arXiv:2403.10266,2024.
[84]SHIH A,BELKHALE S,ERMON S,et al.Parallel sampling of diffusion models[J].Advances in Neural Information Processing Systems,2023,36:4263-4276.
[85]CHEN Z,MA X,FANG G,et al.AsyncDiff:Parallelizing diffusion models by asynchronous denoising[J].arXiv:2406.06911,2024.
[86]FANG J,PAN J,LI A,et al.Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference[C]//39th Conference on Neural Information Processing Systems.2025.
[87]LIU H,ABBEEL P.Blockwise parallel transformers for large context models[J].Advances in Neural Information Processing Systems,2023,36:8828-8844.
[88]RASLEY J,RAJBHANDARI S,RUWASE O,et al.DeepSpeed:System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3505-3506.
[89]LI S,LIU H,BIAN Z,et al.Colossal-AI:A unified deep learning system for large-scale parallel training[C]//Proceedings of the 52nd International Conference on Parallel Processing.2023:766-775.
[90]REN J,RAJBHANDARI S,AMINABADI R Y,et al.ZeRO-Offload:Democratizing billion-scale model training[C]//2021 USENIX Annual Technical Conference(USENIX ATC 21).2021:551-564.
[91]PATEL Z,HE E,MANNAN P,et al.Training Video Foundation Models with NVIDIA NeMo[J].arXiv:2503.12964,2025.
[92]FEI Z,FAN M,YU C,et al.Scaling diffusion transformers to 16 billion parameters[J].arXiv:2407.11633,2024.
[93]YUAN Y,WANG Z,HUANG Z,et al.Expert Race:A flexible routing strategy for scaling diffusion transformer with mixture of experts[J].arXiv:2503.16057,2025.
[94]SHENG G,ZHANG C,YE Z,et al. Hybridflow:A flexible and efficient rlhf framework[C]//Proceedings of the Twentieth European Conference on Computer Systems.2025:1279-1297.
[1] CHEN Peng, HAO Junfeng, XIA Yunni, LI Xi. Novel Multi-task Federated Learning Based Approach for Detecting and Diagnosing Anomalies inCloud Microservices [J]. Computer Science, 2026, 53(5): 388-403.
[2] KANG Jun, GAO Shengkai, LAI Jiabao. Fast Map Matching Method Based on Trajectory Micro-segment Model [J]. Computer Science, 2026, 53(4): 252-259.
[3] ZHAO Haihua, TANG Rui, MO Xian. Review of Methods and Applications of Graph Diffusion Models [J]. Computer Science, 2026, 53(3): 115-128.
[4] WANG Yiming, JIAO Min, ZHAO Suyun, CHEN Hong, LI Cuiping. Prompt-conditioned Representation Learning with Diffusion Models for Semi-supervised Clustering [J]. Computer Science, 2026, 53(3): 158-165.
[5] ZHANG Manjing, HE Yulin, LI Xu, HUANG Zhexue. Distributed Two-stage Clustering Method Based on Node Sampling [J]. Computer Science, 2025, 52(2): 134-144.
[6] WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai. Large-scale Network Community Detection Algorithm Based on MapReduce [J]. Computer Science, 2024, 51(4): 11-18.
[7] GE Yinchi, ZHANG Hui, SUN Haohang. Differential Privacy Data Synthesis Method Based on Latent Diffusion Model [J]. Computer Science, 2024, 51(3): 30-38.
[8] YAN Zhihao, ZHOU Zhangbing, LI Xiaocui. Survey on Generative Diffusion Model [J]. Computer Science, 2024, 51(1): 273-283.
[9] HAN Qiqi, LIU Xin. Application of Air-Sea Coupled Mode in High-speed Interconnection Environment [J]. Computer Science, 2023, 50(11A): 221000136-5.
[10] WANG Ru-bin, LI Rui-yuan, HE Hua-jun, LIU Tong, LI Tian-rui. Distributed Distance Join Algorithm for Massive Spatial Data [J]. Computer Science, 2022, 49(1): 95-100.
[11] QIAN Tian-tian, ZHANG Fan. Emotion Recognition System Based on Distributed Edge Computing [J]. Computer Science, 2021, 48(6A): 638-643.
[12] YUAN Chen-yu, XIE Zai-peng, ZHU Xiao-rui, QU Zhi-hao, XU Yuan-yuan. Convolutional Optimization Algorithm Based on Distributed Coding [J]. Computer Science, 2021, 48(2): 47-54.
[13] LI Bo-jia, ZHANG Yang-sen, CHEN Ruo-yu. Method for Generating Massive Data with Assignable Distribution [J]. Computer Science, 2019, 46(8): 56-63.
[14] ZHU Kun, HUANG Rui-zhang and ZHANG Na-na. Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model [J]. Computer Science, 2017, 44(7): 31-37.
[15] ZHU Kai-long, LU Yu-liang and YANG Bin. Study on Invulnerability of Router-level Internet Based on MapReduce [J]. Computer Science, 2017, 44(11): 168-174.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!