Computer Science ›› 2025, Vol. 52 ›› Issue (6A): 240900125-5.doi: 10.11896/jsjkx.240900125

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Distillation Method for Text-to-Audio Generation Based on Balanced SNR-aware

LIU Bingzhi1, CAO Yin2, ZHOU Yi1   

  1. 1 School of Communications and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2 Department of Intelligent Science,Xi'an Jiaotong-Liverpool University,Suzhou,Jiangsu 215123,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:LIU Bingzhi,born in 1998,postgra-duate.His main research interests include speech synthesis and audio codec.
    ZHOU Yi,born in 1974,Ph.D,professor,Ph.D supervisor.His main research interests include speech enhancement and machine hearing technology.
  • Supported by:
    National Key R&D Program of China(2024QY2630).

Abstract: Diffusion models have demonstrated promising results in text-to-audio(TTA) generation tasks.However,their practical usability is limited by slow sampling speeds,which restricts their applicability in high-throughput scenarios.To address this issue,progressive distillation methods have been applied to effectively create more streamlined and efficient models.Nevertheless,these methods encounter issues of unbalanced weights at both high and low noise levels,potentially impacting the quality of gene-rated samples.In this paper,we propose a method to balanced SNR-aware,which is an enhanced loss-weighting mechanism for diffusion distillation and employs a balanced approach to weight the loss for both high and low noise levels.We evaluate the proposed method on the AudioCaps dataset,the experimental results showing superior performance during the reverse diffusion process compared to previous distillation methods with the same number of sampling steps.Furthermore,the BSA method allows for a significant reduction in sampling steps from 200 to 25,with minimal performance degradation compared to the original teacher models.

Key words: Text-to-Audio, Progressive distillation, Latent diffusion, Accelerated sampling

CLC Number: 

  • TP183
[1]HUANG R,HUANG J,YANG D,et al.Make-an-audio:Text-to-audio generation with prompt-enhanced diffusion models[J].arXiv:2301.12661,2023.
[2]SCHNEIDER F,JIN Z,SCHÖLKOPF B.Moûsai:Text-to-music generation with long-context latent diffusion[J].arXiv:2301.11757,2023.
[3]WANG Y,JU Z,TAN X,et al.Audit:Audio editing by following instructions with latent diffusion models[J].arXiv:2304.00830,2023.
[4]YUAN Y,LIU H,LIU X,et al.Text-driven foley sound generation with latent diffusion model[J].arXiv:2306.10359,2023.
[5]RUAN L,MA Y,YANG H,et al.MM-diffusion:Learningmulti-modal diffusion models for joint audio and video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:10219-10228.
[6]YANG D,YU J,WANG H,et al.:Diffsound:Discrete diffusion model for text-to-sound generation[J].arXiv:2207.09983,2022.
[7]LIU H,CHEN Z,YUAN Y,et al.AudioLDM:Text-to-audiogeneration with latent diffusion models[J].arXiv:2301.12503,2023.
[8]GHOSAL D,MAJUMDER N,MEHRISH A,et al.Text-to-audio generation using instruction-tuned LLM and latent diffusion model[J].arXiv:2304.13731,2023.
[9]SALIMANS T,HO J.Progressive distillation for fast sampling of diffusion models[J].arXiv:2202.00512,2022.
[10]HANG T,GU S,LI C,et al.Efficient diffusion training via min-SNR weighting strategy[J].arXiv:2303.09556,2023.
[11]CHUNG H W ,HOU L,LONGPRE S,et al.Scaling instruction-finetuned language models[J].arXiv:2210.11416,2022.
[12]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[13]OORD A,LI Y,BABUSCHKIN I,et al.Parallel WaveNet:Fast high-fidelity speech synthesis[C]//Proceedings of the International Conference on Machine Learning.2018:3918-3926.
[14] MENG C,ROMBACH R,GAO R,et al.On distillation of guided diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:14297-14306.
[15]CHANG H,ZHANG H,BARBER J,et al.Muse:Text-to-image generation via masked generative transformers[J].arXiv:2301.00704,2023.
[16]SONG Y,SOHL-DICKSTEIN J,KINGMA D P,et al.Score-based generative modeling through stochastic differential equations[J].arXiv:2011.13456,2020.
[17]RONNEBERGER O,FISCHER P,BROX T.U-Net:Convolu-tional networks for biomedical image segmentation[C]//18th International Conference Medical Image Computing and Computer-Assisted Intervention(MICCAI 2015).2015:234-241.
[18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[19]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[20]KIMC D,KIM B,LEE H,et al.AudioCaps:Generating captions for audios in the wild[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.2019:119-132.
[21]KILGOUR K,ZULUAGA M,ROBLEK D,et al.Fréchet audio distance:A reference-free metric for evaluating music enhancement algorithms[C]//Proceedings of Interspeech.2019:2350-2354.
[1] LI Bo, MO Xian. Application of Large Language Models in Recommendation System [J]. Computer Science, 2025, 52(6A): 240400097-7.
[2] SHI Xincheng, WANG Baohui, YU Litao, DU Hui. Study on Segmentation Algorithm of Lower Limb Bone Anatomical Structure Based on 3D CTImages [J]. Computer Science, 2025, 52(6A): 240500119-9.
[3] CHEN Shijia, YE Jianyuan, GONG Xuan, ZENG Kang, NI Pengcheng. Aircraft Landing Gear Safety Pin Detection Algorithm Based on Improved YOlOv5s [J]. Computer Science, 2025, 52(6A): 240400189-7.
[4] ZHANG Hang, WEI Shoulin, YIN Jibin. TalentDepth:A Monocular Depth Estimation Model for Complex Weather Scenarios Based onMultiscale Attention Mechanism [J]. Computer Science, 2025, 52(6A): 240900126-7.
[5] CHENG Yan, HE Huijuan, CHEN Yanying, YAO Nannan, LIN Guobo. Study on interpretable Shallow Class Activation Mapping Algorithm Based on Spatial Weights andInter Layer Correlation [J]. Computer Science, 2025, 52(6A): 240500140-7.
[6] JIANG Haolun, ZHU Jinxia, MENG Xiangfu. Next Point of Interest Recommendation Incorporating Dynamic Social Relationships [J]. Computer Science, 2025, 52(6A): 240600003-7.
[7] ZHENG Chuangrui, DENG Xiuqin, CHEN Lei. Traffic Prediction Model Based on Decoupled Adaptive Dynamic Graph Convolution [J]. Computer Science, 2025, 52(6A): 240400149-8.
[8] WANG Chundong, ZHANG Qinghua, FU Haoran. Federated Learning Privacy Protection Method Combining Dataset Distillation [J]. Computer Science, 2025, 52(6A): 240500132-7.
[9] LIAO Sirui, HUANG Feihu, ZHAN Pengxiang, PENG Jian, ZHANG Linghao. DCDAD:Differentiated Context Dependency for Time Series Anomaly Detection Method [J]. Computer Science, 2025, 52(6): 106-117.
[10] WANG Teng, XIAN Yunting, XU Hao, XIE Songqi, ZOU Quanyi. Ship License Plate Recognition Network Based on Pyramid Transformer in Transformer [J]. Computer Science, 2025, 52(6): 179-186.
[11] WEI Xiaohui, GUAN Zeyu, WANG Chenyang, YUE Hengshan, WU Qi. Hardware-Software Co-design Fault-tolerant Strategies for Systolic Array Accelerators [J]. Computer Science, 2025, 52(5): 91-100.
[12] WU Pengyuan, FANG Wei. Study on Graph Collaborative Filtering Model Based on FeatureNet Contrastive Learning [J]. Computer Science, 2025, 52(5): 139-148.
[13] CONG Yingnan, HAN Linrui, MA Jiayu, ZHU Jinqing. Research on Intelligent Judgment of Criminal Cases Based on Large Language Models [J]. Computer Science, 2025, 52(5): 248-259.
[14] LIU Tengfei, CHEN Liyue, FANG Jiangyi, WANG Leye. SCFNet:Fusion Framework of External Spatial Features for Spatio-temporal Prediction [J]. Computer Science, 2025, 52(4): 110-118.
[15] ZHOU Yi, MAO Kuanmin. Research on Individual Identification of Cattle Based on YOLO-Unet Combined Network [J]. Computer Science, 2025, 52(4): 194-201.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!