基于音素大语言模型及扩散模型的低资源越南语语音合成

doi:10.11896/jsjkx.240700138

Abstract

Abstract: With the development of deep learning technology and the progression of speech synthesis research,synthetic speech in widely spoken and high-resource languages such as Chinese and English has increasingly approached natural speech.Vietnamese,a tonal language closely related to Chinese,belongs to the Vietic branch of the Austroasiatic language family of South Asian languages.Due to the scale of available corpus data and the depth of related research,Vietnamese speech synthesis is still significantly short of natural speech.At the premise of low resources,two methods are proposed to improve the naturalness of Vietnamese speech synthesis:1)The phoneme encoder is constructed based on pre-trained phoneme large language model XPhoneBERT,which significantly improves the prosodic expressiveness of Vietnamese speech synthesis with limited data set.2)Improve the U-Net structure in the lightweight diffusion TTS model LightGrad,add nested jump paths,so that the model can be fully trained under low resource conditions,capture more effective information,improve the accuracy of noise prediction,and thus improve the quality of speech synthesis.Experiment results show that the objective and subjective evaluation performance of the Vietnamese speech synthesis system has been significantly improved by using the proposed method.MCD and MOS are up to 6.25 and 4.22 respectively,which are significantly decreased and increased respectively,compared with 7.44 and 3.56 of the baseline system.

Key words: Speech synthesis, Vietnamese, Low resources, Large language model, Diffusion model

CLC Number:

TP391

ZOU Rui, YANG Jian, ZHANG Kai. Low-resource Vietnamese Speech Synthesis Based on Phoneme Large Language Model andDiffusion Model[J].Computer Science, 2025, 52(6A): 240700138-6.

References

[1]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Interspeech.2017.
[2]SHEN J,PANG R,WEISS R J,et al."Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2018).IEEE,2018.
[3]PING W,PENG K W,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017.
[4]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,robust and controllable text to speech[C]//Advances in Neural Information Processing Systems.2019.
[5]REN Y,HU C,TAN X,et al.FastSpeech 2:Fast and high-quali-ty end-to-end text to speech[J].arXiv:2006.04558,2020.
[6]PENG K W,CHEN J.Clarinet:Parallel wave generation in end-to-end text-to-speech[J].arXiv:1807.07281,2018.
[7]DONAHUE J,DIELEMAN S,BIŃKOWSKI M,et al.End-to-end adversarial text-to-speech[J].arXiv:2006.03575,2020.
[8]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[9]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27.
[10]KIM J,KIM S,KONG J,et al.Glow-TTS:A generative flow for text-to-speech via monotonic alignment search[J].Advances in Neural Information Processing Systems,2020,33:8067-8077.
[11]POPOV V,VOVK I,GOGORYA N,et al.Grad-TTS:A Diffusion Probabilistic Model for Text-to-Speech[C]//International Conference on Machine Learning(2021).
[12]CHEN J.LightGrad:Lightweight Diffusion Probabilistic Model for Text-to-Speech[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2023).2023:1-5.
[13]LU C,ZHOU Y,BAO F,et al.DPM-Solver:A fast ode solver for diffusion probabilistic model sampling in around 10 steps[J].Advances in Neural Information Processing Systems,2022,35:5775-5787.
[14]LIANG Z,SHI H,WANG J,et al.EM-TTS:Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech[J].arXiv:2403.08164,2024.
[15]JEONGM,KIM M,CHOI B J,et al.Transfer Learning for Low-Resource,Multi-Lingual,and Zero-Shot Multi-Speaker Text-to-Speech[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2024.
[16]LAM T Q,et al.Instance-based transfer learning approach forVietnamese speech synthesis with very low resource[C]//Future of Information and Communication Conference.Cham:Springer International Publishing,2022.
[17]PHUN V L.Data processing for optimizing naturalness of Vietnamese text-to-speech system[C]//2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques(O-COCOSDA).IEEE,2020.
[18]NGUYEN L T,THINH P,DAT Q N.XPhoneBERT:A Pre-trained MULTILINGUAL Model for Phoneme Representations for Text-to-Speech[J].arXiv:2305.19709,2023.
[19]ZHOU Z,SIDDIQUEE M M R,TAJBAKHSH N,et al.UNet++:Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation[J].IEEE Transactions on Medical Imaging,2020,39(6):1856-1867.
[20]SONG Y,SOHL-DICKSTEIN J,KINGMAD P,et al.Score-based generative modeling through stochastic differential equations[J].arXiv:2011.13456,2020.
[21]KONG J,KIM J,BAE J.HiFi-GAN:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[22]CHOLLET F.Xception:Deep Learning with Depthwise Separable Convolutions[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu,HI,USA,2017:1800-1807.
[23]ElLINAS N,VAMVOUKAKIS G,MARKOPOULOS K,et al.High quality streaming speech synthesis with low,sentence-length-independent latency[J].arXiv:2111.09052,2021.
[24]DEVLIN J.BERT:Pre-training of Deep Bidirectional Trans-formers for Language Understanding[C]//North American Chapter of the Association for Computational Linguistics.2019.
[25]LIU Y,OTT M,GOYAL N,et al.RoBERTa:A robustly optimized bert pretraining approach[J].arXiv:1907.11692,2019.
[26]MISRA D.Mish:A Self Regularized Non-Monotonic Activation Function[J].British Machine Vision Conference,2020.
[27]ZHUORAN S,MINGYUAN Z,HAIYU Z,et al.Efficient Attention:Attention with Linear Complexities[C]//2021 IEEE Winter Conference on Applications of Computer Vision(WACV).Waikoloa,HI,USA,2021:3530-3538.

Related Articles 15

[1]	ZHOU Lei, SHI Huaifeng, YANG Kai, WANG Rui, LIU Chaofan. Intelligent Prediction of Network Traffic Based on Large Language Model [J]. Computer Science, 2025, 52(6A): 241100058-7.
[2]	BAI Yuntian, HAO Wenning, JIN Dawei. Study on Open-domain Question Answering Methods Based on Retrieval-augmented Generation [J]. Computer Science, 2025, 52(6A): 240800141-7.
[3]	ZHANG Le, CHE Chao, LIANG Yan. Hallucinations Proactive Relief in Diabetes Q&A LLM [J]. Computer Science, 2025, 52(6A): 240700182-10.
[4]	YIN Baosheng, ZONG Chen. Research on Semantic Fusion of Chinese Polysemous Words Based on Large LanguageModel [J]. Computer Science, 2025, 52(6A): 240400139-7.
[5]	HU Caishun. Study on Named Entity Recognition Algorithms in Audit Domain Based on Large LanguageModels [J]. Computer Science, 2025, 52(6A): 240700190-4.
[6]	ZHAO Zheyu, WANG Zhongqing, WANG Hongling. Commodity Attribute Classification Method Based on Dual Pre-training [J]. Computer Science, 2025, 52(6A): 240500127-8.
[7]	TU Ji, XIAO Wendong, TU Wenji, LI Lijian. Application of Large Language Models in Medical Education:Current Situation,Challenges and Future [J]. Computer Science, 2025, 52(6A): 240400121-6.
[8]	LI Bo, MO Xian. Application of Large Language Models in Recommendation System [J]. Computer Science, 2025, 52(6A): 240400097-7.
[9]	HOU Zhexiao, LI Bicheng, CAI Bingyan, XU Yifei. High Quality Image Generation Method Based on Improved Diffusion Model [J]. Computer Science, 2025, 52(6A): 240500094-9.
[10]	GENG Sheng, DING Weiping, JU Hengrong, HUANG Jiashuang, JIANG Shu, WANG Haipeng. FDiff-Fusion:Medical Image Diffusion Fusion Network Segmentation Model Driven Based onFuzzy Logic [J]. Computer Science, 2025, 52(6): 274-285.
[11]	GAO Hongkui, MA Ruixiang, BAO Qihao, XIA Shaojie, QU Chongxiao. Research on Hybrid Retrieval-augmented Dual-tower Model [J]. Computer Science, 2025, 52(6): 324-329.
[12]	KANG Kai, WANG Jiabao, XU Kun. Balancing Transferability and Imperceptibility for Adversarial Attacks [J]. Computer Science, 2025, 52(6): 381-389.
[13]	CHEN Xuhao, HU Sipeng, LIU Hongchao, LIU Boran, TANG Dan, ZHAO Di. Research on LLM Vector Dot Product Acceleration Based on RISC-V Matrix Instruction Set Extension [J]. Computer Science, 2025, 52(5): 83-90.
[14]	CONG Yingnan, HAN Linrui, MA Jiayu, ZHU Jinqing. Research on Intelligent Judgment of Criminal Cases Based on Large Language Models [J]. Computer Science, 2025, 52(5): 248-259.
[15]	ZHU Shucheng, HUO Hongying, WANG Weikang, LIU Ying, LIU Pengyuan. Automatic Optimization and Evaluation of Prompt Fairness Based on Large Language Model Itself [J]. Computer Science, 2025, 52(4): 240-248.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Low-resource Vietnamese Speech Synthesis Based on Phoneme Large Language Model andDiffusion Model

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0