计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240700138-6.doi: 10.11896/jsjkx.240700138

• 大语言模型技术及应用 • 上一篇    下一篇

基于音素大语言模型及扩散模型的低资源越南语语音合成

邹睿, 杨鉴, 张凯   

  1. 云南大学信息学院 昆明 650504
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 杨鉴(jianyang@ynu.edu.cn)
  • 作者简介:(june_ouui@163.com)
  • 基金资助:
    国家重点研发计划资助项目(2020AAA0107901)

Low-resource Vietnamese Speech Synthesis Based on Phoneme Large Language Model andDiffusion Model

ZOU Rui, YANG Jian, ZHANG Kai   

  1. School of Information Science & Engineering,Yunnan University,Kunming 650504,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:ZOU Rui,born in 2000,postgraduate.Her main research interests include speech synthesis,recognition and understanding.
    YANG Jian,born in 1964,Ph.D,professor.His main research interests include speech synthesis,recognition and understanding.
  • Supported by:
    National Key Research and Development Program(2020AAA0107901).

摘要: 随着深度学习技术的发展及语音合成研究的深入,汉语、英语等通用、高资源语言的合成语音已越来越接近于自然语音。越南语与汉语有密切联系,是一种声调语言,属于南亚语系越芒语族越语支。因受制于可获取的语料数据规模以及相关研究的深入程度,越南语语音合成离自然语音还有明显差距。在低资源前提下,提出了两种提高越南语语音合成自然度的方法:1)基于预训练的音素大语言模型XPhoneBERT构建音素编码器,在数据集有限的情况下,显著提高越南语语音合成的韵律表现力;2)改进轻量化扩散语音合成模型LightGrad中的U-Net结构,增加嵌套跳跃路径,使模型在低资源条件下得到充分训练、捕获更有效的信息、提高噪声预测的准确性,从而提升语音合成质量。实验结果表明,采用上述提出的方法,越南语语音合成系统的客观、主观评测性能有明显的提升,MCD(梅尔倒谱失真)和MOS(平均意见得分)分别达到6.25和4.22,相比于基线系统的7.44和3.56有明显的下降和提升。

关键词: 语音合成, 越南语, 低资源, 大语言模型, 扩散模型

Abstract: With the development of deep learning technology and the progression of speech synthesis research,synthetic speech in widely spoken and high-resource languages such as Chinese and English has increasingly approached natural speech.Vietnamese,a tonal language closely related to Chinese,belongs to the Vietic branch of the Austroasiatic language family of South Asian languages.Due to the scale of available corpus data and the depth of related research,Vietnamese speech synthesis is still significantly short of natural speech.At the premise of low resources,two methods are proposed to improve the naturalness of Vietnamese speech synthesis:1)The phoneme encoder is constructed based on pre-trained phoneme large language model XPhoneBERT,which significantly improves the prosodic expressiveness of Vietnamese speech synthesis with limited data set.2)Improve the U-Net structure in the lightweight diffusion TTS model LightGrad,add nested jump paths,so that the model can be fully trained under low resource conditions,capture more effective information,improve the accuracy of noise prediction,and thus improve the quality of speech synthesis.Experiment results show that the objective and subjective evaluation performance of the Vietnamese speech synthesis system has been significantly improved by using the proposed method.MCD and MOS are up to 6.25 and 4.22 respectively,which are significantly decreased and increased respectively,compared with 7.44 and 3.56 of the baseline system.

Key words: Speech synthesis, Vietnamese, Low resources, Large language model, Diffusion model

中图分类号: 

  • TP391
[1]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Interspeech.2017.
[2]SHEN J,PANG R,WEISS R J,et al."Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2018).IEEE,2018.
[3]PING W,PENG K W,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017.
[4]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,robust and controllable text to speech[C]//Advances in Neural Information Processing Systems.2019.
[5]REN Y,HU C,TAN X,et al.FastSpeech 2:Fast and high-quali-ty end-to-end text to speech[J].arXiv:2006.04558,2020.
[6]PENG K W,CHEN J.Clarinet:Parallel wave generation in end-to-end text-to-speech[J].arXiv:1807.07281,2018.
[7]DONAHUE J,DIELEMAN S,BIŃKOWSKI M,et al.End-to-end adversarial text-to-speech[J].arXiv:2006.03575,2020.
[8]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[9]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27.
[10]KIM J,KIM S,KONG J,et al.Glow-TTS:A generative flow for text-to-speech via monotonic alignment search[J].Advances in Neural Information Processing Systems,2020,33:8067-8077.
[11]POPOV V,VOVK I,GOGORYA N,et al.Grad-TTS:A Diffusion Probabilistic Model for Text-to-Speech[C]//International Conference on Machine Learning(2021).
[12]CHEN J.LightGrad:Lightweight Diffusion Probabilistic Model for Text-to-Speech[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2023).2023:1-5.
[13]LU C,ZHOU Y,BAO F,et al.DPM-Solver:A fast ode solver for diffusion probabilistic model sampling in around 10 steps[J].Advances in Neural Information Processing Systems,2022,35:5775-5787.
[14]LIANG Z,SHI H,WANG J,et al.EM-TTS:Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech[J].arXiv:2403.08164,2024.
[15]JEONGM,KIM M,CHOI B J,et al.Transfer Learning for Low-Resource,Multi-Lingual,and Zero-Shot Multi-Speaker Text-to-Speech[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2024.
[16]LAM T Q,et al.Instance-based transfer learning approach forVietnamese speech synthesis with very low resource[C]//Future of Information and Communication Conference.Cham:Springer International Publishing,2022.
[17]PHUN V L.Data processing for optimizing naturalness of Vietnamese text-to-speech system[C]//2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques(O-COCOSDA).IEEE,2020.
[18]NGUYEN L T,THINH P,DAT Q N.XPhoneBERT:A Pre-trained MULTILINGUAL Model for Phoneme Representations for Text-to-Speech[J].arXiv:2305.19709,2023.
[19]ZHOU Z,SIDDIQUEE M M R,TAJBAKHSH N,et al.UNet++:Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation[J].IEEE Transactions on Medical Imaging,2020,39(6):1856-1867.
[20]SONG Y,SOHL-DICKSTEIN J,KINGMAD P,et al.Score-based generative modeling through stochastic differential equations[J].arXiv:2011.13456,2020.
[21]KONG J,KIM J,BAE J.HiFi-GAN:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[22]CHOLLET F.Xception:Deep Learning with Depthwise Separable Convolutions[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu,HI,USA,2017:1800-1807.
[23]ElLINAS N,VAMVOUKAKIS G,MARKOPOULOS K,et al.High quality streaming speech synthesis with low,sentence-length-independent latency[J].arXiv:2111.09052,2021.
[24]DEVLIN J.BERT:Pre-training of Deep Bidirectional Trans-formers for Language Understanding[C]//North American Chapter of the Association for Computational Linguistics.2019.
[25]LIU Y,OTT M,GOYAL N,et al.RoBERTa:A robustly optimized bert pretraining approach[J].arXiv:1907.11692,2019.
[26]MISRA D.Mish:A Self Regularized Non-Monotonic Activation Function[J].British Machine Vision Conference,2020.
[27]ZHUORAN S,MINGYUAN Z,HAIYU Z,et al.Efficient Attention:Attention with Linear Complexities[C]//2021 IEEE Winter Conference on Applications of Computer Vision(WACV).Waikoloa,HI,USA,2021:3530-3538.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!