计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240900125-5.doi: 10.11896/jsjkx.240900125
刘炳志1, 曹寅2, 周翊1
LIU Bingzhi1, CAO Yin2, ZHOU Yi1
摘要: 扩散模型在文本到音频(TTA)生成任务中表现优异,但其采样速度较慢,限制了在高吞吐量场景中的应用。为提升效率,渐进式蒸馏法被用于创建更精简的模型。然而,该方法在高噪声和低噪声水平下的损失权重分配不平衡,影响了训练效果和生成质量。为此,提出了一种平衡信噪比感知(BSA)的方法,它是一种改进的损失加权机制,专用于扩散蒸馏,平衡高噪声和低噪声的损失。实验在公开数据集AudioCaps上进行评估,结果表明,BSA方法在相同采样步数下优于以往的蒸馏方法,并将采样步数从200减少至25,且生成质量与原始教师模型相比几乎无差异。
中图分类号:
[1]HUANG R,HUANG J,YANG D,et al.Make-an-audio:Text-to-audio generation with prompt-enhanced diffusion models[J].arXiv:2301.12661,2023. [2]SCHNEIDER F,JIN Z,SCHÖLKOPF B.Moûsai:Text-to-music generation with long-context latent diffusion[J].arXiv:2301.11757,2023. [3]WANG Y,JU Z,TAN X,et al.Audit:Audio editing by following instructions with latent diffusion models[J].arXiv:2304.00830,2023. [4]YUAN Y,LIU H,LIU X,et al.Text-driven foley sound generation with latent diffusion model[J].arXiv:2306.10359,2023. [5]RUAN L,MA Y,YANG H,et al.MM-diffusion:Learningmulti-modal diffusion models for joint audio and video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:10219-10228. [6]YANG D,YU J,WANG H,et al.:Diffsound:Discrete diffusion model for text-to-sound generation[J].arXiv:2207.09983,2022. [7]LIU H,CHEN Z,YUAN Y,et al.AudioLDM:Text-to-audiogeneration with latent diffusion models[J].arXiv:2301.12503,2023. [8]GHOSAL D,MAJUMDER N,MEHRISH A,et al.Text-to-audio generation using instruction-tuned LLM and latent diffusion model[J].arXiv:2304.13731,2023. [9]SALIMANS T,HO J.Progressive distillation for fast sampling of diffusion models[J].arXiv:2202.00512,2022. [10]HANG T,GU S,LI C,et al.Efficient diffusion training via min-SNR weighting strategy[J].arXiv:2303.09556,2023. [11]CHUNG H W ,HOU L,LONGPRE S,et al.Scaling instruction-finetuned language models[J].arXiv:2210.11416,2022. [12]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015. [13]OORD A,LI Y,BABUSCHKIN I,et al.Parallel WaveNet:Fast high-fidelity speech synthesis[C]//Proceedings of the International Conference on Machine Learning.2018:3918-3926. [14] MENG C,ROMBACH R,GAO R,et al.On distillation of guided diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:14297-14306. [15]CHANG H,ZHANG H,BARBER J,et al.Muse:Text-to-image generation via masked generative transformers[J].arXiv:2301.00704,2023. [16]SONG Y,SOHL-DICKSTEIN J,KINGMA D P,et al.Score-based generative modeling through stochastic differential equations[J].arXiv:2011.13456,2020. [17]RONNEBERGER O,FISCHER P,BROX T.U-Net:Convolu-tional networks for biomedical image segmentation[C]//18th International Conference Medical Image Computing and Computer-Assisted Intervention(MICCAI 2015).2015:234-241. [18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [19]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851. [20]KIMC D,KIM B,LEE H,et al.AudioCaps:Generating captions for audios in the wild[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.2019:119-132. [21]KILGOUR K,ZULUAGA M,ROBLEK D,et al.Fréchet audio distance:A reference-free metric for evaluating music enhancement algorithms[C]//Proceedings of Interspeech.2019:2350-2354. |
|