计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240700138-6.doi: 10.11896/jsjkx.240700138
邹睿, 杨鉴, 张凯
ZOU Rui, YANG Jian, ZHANG Kai
摘要: 随着深度学习技术的发展及语音合成研究的深入,汉语、英语等通用、高资源语言的合成语音已越来越接近于自然语音。越南语与汉语有密切联系,是一种声调语言,属于南亚语系越芒语族越语支。因受制于可获取的语料数据规模以及相关研究的深入程度,越南语语音合成离自然语音还有明显差距。在低资源前提下,提出了两种提高越南语语音合成自然度的方法:1)基于预训练的音素大语言模型XPhoneBERT构建音素编码器,在数据集有限的情况下,显著提高越南语语音合成的韵律表现力;2)改进轻量化扩散语音合成模型LightGrad中的U-Net结构,增加嵌套跳跃路径,使模型在低资源条件下得到充分训练、捕获更有效的信息、提高噪声预测的准确性,从而提升语音合成质量。实验结果表明,采用上述提出的方法,越南语语音合成系统的客观、主观评测性能有明显的提升,MCD(梅尔倒谱失真)和MOS(平均意见得分)分别达到6.25和4.22,相比于基线系统的7.44和3.56有明显的下降和提升。
中图分类号:
[1]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Interspeech.2017. [2]SHEN J,PANG R,WEISS R J,et al."Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2018).IEEE,2018. [3]PING W,PENG K W,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017. [4]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,robust and controllable text to speech[C]//Advances in Neural Information Processing Systems.2019. [5]REN Y,HU C,TAN X,et al.FastSpeech 2:Fast and high-quali-ty end-to-end text to speech[J].arXiv:2006.04558,2020. [6]PENG K W,CHEN J.Clarinet:Parallel wave generation in end-to-end text-to-speech[J].arXiv:1807.07281,2018. [7]DONAHUE J,DIELEMAN S,BIŃKOWSKI M,et al.End-to-end adversarial text-to-speech[J].arXiv:2006.03575,2020. [8]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851. [9]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27. [10]KIM J,KIM S,KONG J,et al.Glow-TTS:A generative flow for text-to-speech via monotonic alignment search[J].Advances in Neural Information Processing Systems,2020,33:8067-8077. [11]POPOV V,VOVK I,GOGORYA N,et al.Grad-TTS:A Diffusion Probabilistic Model for Text-to-Speech[C]//International Conference on Machine Learning(2021). [12]CHEN J.LightGrad:Lightweight Diffusion Probabilistic Model for Text-to-Speech[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2023).2023:1-5. [13]LU C,ZHOU Y,BAO F,et al.DPM-Solver:A fast ode solver for diffusion probabilistic model sampling in around 10 steps[J].Advances in Neural Information Processing Systems,2022,35:5775-5787. [14]LIANG Z,SHI H,WANG J,et al.EM-TTS:Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech[J].arXiv:2403.08164,2024. [15]JEONGM,KIM M,CHOI B J,et al.Transfer Learning for Low-Resource,Multi-Lingual,and Zero-Shot Multi-Speaker Text-to-Speech[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2024. [16]LAM T Q,et al.Instance-based transfer learning approach forVietnamese speech synthesis with very low resource[C]//Future of Information and Communication Conference.Cham:Springer International Publishing,2022. [17]PHUN V L.Data processing for optimizing naturalness of Vietnamese text-to-speech system[C]//2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques(O-COCOSDA).IEEE,2020. [18]NGUYEN L T,THINH P,DAT Q N.XPhoneBERT:A Pre-trained MULTILINGUAL Model for Phoneme Representations for Text-to-Speech[J].arXiv:2305.19709,2023. [19]ZHOU Z,SIDDIQUEE M M R,TAJBAKHSH N,et al.UNet++:Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation[J].IEEE Transactions on Medical Imaging,2020,39(6):1856-1867. [20]SONG Y,SOHL-DICKSTEIN J,KINGMAD P,et al.Score-based generative modeling through stochastic differential equations[J].arXiv:2011.13456,2020. [21]KONG J,KIM J,BAE J.HiFi-GAN:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033. [22]CHOLLET F.Xception:Deep Learning with Depthwise Separable Convolutions[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu,HI,USA,2017:1800-1807. [23]ElLINAS N,VAMVOUKAKIS G,MARKOPOULOS K,et al.High quality streaming speech synthesis with low,sentence-length-independent latency[J].arXiv:2111.09052,2021. [24]DEVLIN J.BERT:Pre-training of Deep Bidirectional Trans-formers for Language Understanding[C]//North American Chapter of the Association for Computational Linguistics.2019. [25]LIU Y,OTT M,GOYAL N,et al.RoBERTa:A robustly optimized bert pretraining approach[J].arXiv:1907.11692,2019. [26]MISRA D.Mish:A Self Regularized Non-Monotonic Activation Function[J].British Machine Vision Conference,2020. [27]ZHUORAN S,MINGYUAN Z,HAIYU Z,et al.Efficient Attention:Attention with Linear Complexities[C]//2021 IEEE Winter Conference on Applications of Computer Vision(WACV).Waikoloa,HI,USA,2021:3530-3538. |
|