计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240900125-5.doi: 10.11896/jsjkx.240900125

• 图像处理&多媒体技术 • 上一篇    下一篇

基于平衡信噪比感知的文本到音频生成蒸馏方法

刘炳志1, 曹寅2, 周翊1   

  1. 1 重庆邮电大学通信与信息工程学院 重庆 400065
    2 西安交通利物浦大学智能科学系 江苏 苏州 215123
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 周翊(zhouy@cqupt.edu.cn)
  • 作者简介:(s220131050@stu.cqupt.edu.cn)
  • 基金资助:
    国家重点研发计划专项课题(2024QY2630)

Distillation Method for Text-to-Audio Generation Based on Balanced SNR-aware

LIU Bingzhi1, CAO Yin2, ZHOU Yi1   

  1. 1 School of Communications and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2 Department of Intelligent Science,Xi'an Jiaotong-Liverpool University,Suzhou,Jiangsu 215123,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:LIU Bingzhi,born in 1998,postgra-duate.His main research interests include speech synthesis and audio codec.
    ZHOU Yi,born in 1974,Ph.D,professor,Ph.D supervisor.His main research interests include speech enhancement and machine hearing technology.
  • Supported by:
    National Key R&D Program of China(2024QY2630).

摘要: 扩散模型在文本到音频(TTA)生成任务中表现优异,但其采样速度较慢,限制了在高吞吐量场景中的应用。为提升效率,渐进式蒸馏法被用于创建更精简的模型。然而,该方法在高噪声和低噪声水平下的损失权重分配不平衡,影响了训练效果和生成质量。为此,提出了一种平衡信噪比感知(BSA)的方法,它是一种改进的损失加权机制,专用于扩散蒸馏,平衡高噪声和低噪声的损失。实验在公开数据集AudioCaps上进行评估,结果表明,BSA方法在相同采样步数下优于以往的蒸馏方法,并将采样步数从200减少至25,且生成质量与原始教师模型相比几乎无差异。

关键词: 文本到音频, 渐进式蒸馏, 潜在扩散, 加速采样

Abstract: Diffusion models have demonstrated promising results in text-to-audio(TTA) generation tasks.However,their practical usability is limited by slow sampling speeds,which restricts their applicability in high-throughput scenarios.To address this issue,progressive distillation methods have been applied to effectively create more streamlined and efficient models.Nevertheless,these methods encounter issues of unbalanced weights at both high and low noise levels,potentially impacting the quality of gene-rated samples.In this paper,we propose a method to balanced SNR-aware,which is an enhanced loss-weighting mechanism for diffusion distillation and employs a balanced approach to weight the loss for both high and low noise levels.We evaluate the proposed method on the AudioCaps dataset,the experimental results showing superior performance during the reverse diffusion process compared to previous distillation methods with the same number of sampling steps.Furthermore,the BSA method allows for a significant reduction in sampling steps from 200 to 25,with minimal performance degradation compared to the original teacher models.

Key words: Text-to-Audio, Progressive distillation, Latent diffusion, Accelerated sampling

中图分类号: 

  • TP183
[1]HUANG R,HUANG J,YANG D,et al.Make-an-audio:Text-to-audio generation with prompt-enhanced diffusion models[J].arXiv:2301.12661,2023.
[2]SCHNEIDER F,JIN Z,SCHÖLKOPF B.Moûsai:Text-to-music generation with long-context latent diffusion[J].arXiv:2301.11757,2023.
[3]WANG Y,JU Z,TAN X,et al.Audit:Audio editing by following instructions with latent diffusion models[J].arXiv:2304.00830,2023.
[4]YUAN Y,LIU H,LIU X,et al.Text-driven foley sound generation with latent diffusion model[J].arXiv:2306.10359,2023.
[5]RUAN L,MA Y,YANG H,et al.MM-diffusion:Learningmulti-modal diffusion models for joint audio and video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:10219-10228.
[6]YANG D,YU J,WANG H,et al.:Diffsound:Discrete diffusion model for text-to-sound generation[J].arXiv:2207.09983,2022.
[7]LIU H,CHEN Z,YUAN Y,et al.AudioLDM:Text-to-audiogeneration with latent diffusion models[J].arXiv:2301.12503,2023.
[8]GHOSAL D,MAJUMDER N,MEHRISH A,et al.Text-to-audio generation using instruction-tuned LLM and latent diffusion model[J].arXiv:2304.13731,2023.
[9]SALIMANS T,HO J.Progressive distillation for fast sampling of diffusion models[J].arXiv:2202.00512,2022.
[10]HANG T,GU S,LI C,et al.Efficient diffusion training via min-SNR weighting strategy[J].arXiv:2303.09556,2023.
[11]CHUNG H W ,HOU L,LONGPRE S,et al.Scaling instruction-finetuned language models[J].arXiv:2210.11416,2022.
[12]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[13]OORD A,LI Y,BABUSCHKIN I,et al.Parallel WaveNet:Fast high-fidelity speech synthesis[C]//Proceedings of the International Conference on Machine Learning.2018:3918-3926.
[14] MENG C,ROMBACH R,GAO R,et al.On distillation of guided diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:14297-14306.
[15]CHANG H,ZHANG H,BARBER J,et al.Muse:Text-to-image generation via masked generative transformers[J].arXiv:2301.00704,2023.
[16]SONG Y,SOHL-DICKSTEIN J,KINGMA D P,et al.Score-based generative modeling through stochastic differential equations[J].arXiv:2011.13456,2020.
[17]RONNEBERGER O,FISCHER P,BROX T.U-Net:Convolu-tional networks for biomedical image segmentation[C]//18th International Conference Medical Image Computing and Computer-Assisted Intervention(MICCAI 2015).2015:234-241.
[18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[19]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[20]KIMC D,KIM B,LEE H,et al.AudioCaps:Generating captions for audios in the wild[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.2019:119-132.
[21]KILGOUR K,ZULUAGA M,ROBLEK D,et al.Fréchet audio distance:A reference-free metric for evaluating music enhancement algorithms[C]//Proceedings of Interspeech.2019:2350-2354.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!