计算机科学 ›› 2023, Vol. 50 ›› Issue (8): 118-124.doi: 10.11896/jsjkx.220600045

• 计算机图形学&多媒体 • 上一篇    下一篇

基于迁移学习的越南语语音合成

杨琳1, 杨鉴1, 蔡浩然1, 刘聪2   

  1. 1 云南大学信息学院 昆明 650504
    2 科大讯飞股份有限公司人工智能研究院 合肥 230088
  • 收稿日期:2022-06-06 修回日期:2023-02-07 出版日期:2023-08-15 发布日期:2023-08-02
  • 通讯作者: 杨鉴(jianyang@ynu.edu.cn)
  • 作者简介:(yun20yl@mail.ynu.edu.cn)
  • 基金资助:
    国家重点研发计划(2020AAA0107901)

Vietnamese Speech Synthesis Based on Transfer Learning

YANG Lin1, YANG Jian1, CAI Haoran1, LIU Cong2   

  1. 1 School of Information Science and Engineering,Yunnan University,Kunming,650504,China
    2 AI Research Institute,iFLYTEK Co.,Ltd.,Hefei,230088,China
  • Received:2022-06-06 Revised:2023-02-07 Online:2023-08-15 Published:2023-08-02
  • About author:YANG Lin,born in 1999,postgraduate.Her main research interest is speech synthesis,recognition and understan-ding.
    YANG Jian,born in 1964,Ph.D,professor.His main research interest is speech synthesis,recognition and understanding.
  • Supported by:
    National Key R & D Program of China(2020AAA0107901).

摘要: 越南语是越南社会主义共和国的官方语言,属南亚语系越芒语族越语支。近年来基于深度学习的语音合成已经能够合成出高质量的语音,然而这类方法通常依赖于大规模的高质量语音训练数据。解决某些低资源非通用语语音训练数据不足问题的一种有效途径为:采用迁移学习方法并借用其他高资源通用语语音数据。在低资源条件下,以提高越南语语音合成质量为目标,选用端到端语音合成模型Tacotorn2作为基线模型,采用迁移学习方法研究不同源语言和不同文本字符嵌入方式、迁移学习方式对语音合成效果的影响;然后从主观和客观两方面对文中阐述的各种模型所合成的语音进行测评。实验结果表明,基于英语音素嵌入+越南语音素嵌入方式的迁移学习系统在合成自然可懂的越南语语音上取得了较好的结果,合成语音的MOS评分可达4.11分,远高于基线系统的2.53分。

关键词: 越南语, 语音合成, 迁移学习, 文本嵌入, 端到端

Abstract: Vietnamese is the official language of the Socialist Republic of Vietnam.It belongs to the Vietnamese branch of the Viet Muang language family of the South Asian language family.In recent years,deep learning-based speech synthesis has been able to synthesize high-quality speech.However,these methods often rely on large-scale high-quality speech training data.An effective way to solve the problem of insufficient data for some low-resource non-lingua franca speech training is to adopt a transfer learning method and borrow other high-resource lingua franca speech data.Under the condition of low resources,with the goal of improving the quality of Vietnamese speech synthesis,the end-to-end speech synthesis model Tacotorn2 is selected as the baseline model,and the effects of different source languages,different text character embedding methods and transfer learning methods on the effect of speech synthesis are studied by transfer learning methods.Then,from both subjective and objective aspects,the speech synthesized by the various models described in this paper is evaluated.Experimental results show that the transfer learning system based on English phonetic module embedding+Vietnamese phonology embedding method has achieved good results in synthesizing naturally understandable Vietnamese speech,and the MOS score of synthetic speech can reach 4.11,which is much higher than the 2.53 of the baseline system.

Key words: Vietnamese, Speech synthesis, Transfer learning, Text embedding, End-to-end

中图分类号: 

  • TP391
[1]YANG J.An analysis of the linguistic family of the nanking people in Vietnam [J].Ideological Front,2012,38(2):133-134.
[2]TAN X,QIN T,SOONG F,et al.A survey on neural speech synthesis [J].arXiv:2106.15561,2021.
[3]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Taco-tron:Towards End-to-End Speech Synthesis[C]//Proceedings of Conference of the International Speech Communication Association.Stockholm,Sweden,2017:4006-4010.
[4]SHEN J,PANG R,WEISS R J,et al.Natural TTD synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783.
[5]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:2000-Speaker Neural Text-to-Speech[C]//Proceedings of the 3rd International Conference on Learning Representations(ICLR).2017:1-15.
[6]OORD A,DIELEMAN S,ZEN H,et al.Wavenet:A generative model for raw audio[J].arXiv:1609.03499,2016.
[7]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural Text-to-Speech[C]//International Conference on Machine Learning.PMLR,2017:195-204.
[8]GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:Multi-speaker neural Text-to-Speech[J].Advances in Neural Information Processing Systems,2017,30:1-15.
[9]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,Robust and Controllable Text to Speech[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:3171-3180.
[10]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and High-Quality End-to-End Text to Speech[C]//Proceedings of the 3rd International Conference on Learning Representations(ICLR).2020:1-15.
[11]GRIFFIN D,LIM J.Signal estimation from modified short-time Fourier transform[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1984,32(2):236-243.
[12]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networks?[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems(Volume 2).2014:3320-3328.
[13]WANG X Z,LI Q L,LI W H.Spatio-temporal model of soil moisture prediction integrated with transfer learning[J].Journal of Jilin University(Engineering and Technology Edition),2022,52(3):675-683.
[14]WANG J f,LIU F,YANG S,et al.Dam Crack Detection Based on Multi-source Transfer Learning[J].Computer Science,2022,49(6A):319-324.
[15]PAN S J,YANG Q.A survey on transfer learning [J].IEEE Transactions on Knowledge and Data Engineering,2009,22(10):1345-1359.
[16]ZHANG Y,WEISS R J,ZEN H,et al.Learning to Speak Fluen-tly in a Foreign Language:Multilingual Speech Synthesis and Cross-Language Voice Cloning[C]//Proceedings of Conference of the International Speech Communication Association.Graz,Austria,2019:2080-2084.
[17]NEKVINDA T,DUŠEK O.One Model,Many Languages:Meta-Learning for Multilingual Text-to-Speech[C]//Proceedings of Conference of the International Speech Communication Association.Shanghai,China,2020:2972-2976.
[18]ZHOU X,TIAN X,LEE G,et al.End-to-end code-switchingTTS with cross-lingual language model[C]//IEEE Interna-tional Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7614-7618.
[19]HAN X,ZHANG Z,DING N,et al.Pre-trained models:past,present and future [J].AI Open,2021,2:225-250.
[20]PAPADIMITRIOU I,CHI E A,FUTRELL R,et al.Deep Subjecthood:Higher-Order Grammatical Features in Multilingual BERT[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:2522-2532.
[21]CHENG F.An Introduction to Modern Vietnamese [D].Nanning:Guangxi University for Nationalities,1988.
[22]SHI Y,BU H,XU X,et al.Aishell-3:A multi-speakerMandarin TTS corpus and the baselines[J].arXiv:2010.11567,2020.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!