Computer Science ›› 2023, Vol. 50 ›› Issue (8): 118-124.doi: 10.11896/jsjkx.220600045

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Vietnamese Speech Synthesis Based on Transfer Learning

YANG Lin1, YANG Jian1, CAI Haoran1, LIU Cong2   

  1. 1 School of Information Science and Engineering,Yunnan University,Kunming,650504,China
    2 AI Research Institute,iFLYTEK Co.,Ltd.,Hefei,230088,China
  • Received:2022-06-06 Revised:2023-02-07 Online:2023-08-15 Published:2023-08-02
  • About author:YANG Lin,born in 1999,postgraduate.Her main research interest is speech synthesis,recognition and understan-ding.
    YANG Jian,born in 1964,Ph.D,professor.His main research interest is speech synthesis,recognition and understanding.
  • Supported by:
    National Key R & D Program of China(2020AAA0107901).

Abstract: Vietnamese is the official language of the Socialist Republic of Vietnam.It belongs to the Vietnamese branch of the Viet Muang language family of the South Asian language family.In recent years,deep learning-based speech synthesis has been able to synthesize high-quality speech.However,these methods often rely on large-scale high-quality speech training data.An effective way to solve the problem of insufficient data for some low-resource non-lingua franca speech training is to adopt a transfer learning method and borrow other high-resource lingua franca speech data.Under the condition of low resources,with the goal of improving the quality of Vietnamese speech synthesis,the end-to-end speech synthesis model Tacotorn2 is selected as the baseline model,and the effects of different source languages,different text character embedding methods and transfer learning methods on the effect of speech synthesis are studied by transfer learning methods.Then,from both subjective and objective aspects,the speech synthesized by the various models described in this paper is evaluated.Experimental results show that the transfer learning system based on English phonetic module embedding+Vietnamese phonology embedding method has achieved good results in synthesizing naturally understandable Vietnamese speech,and the MOS score of synthetic speech can reach 4.11,which is much higher than the 2.53 of the baseline system.

Key words: Vietnamese, Speech synthesis, Transfer learning, Text embedding, End-to-end

CLC Number: 

  • TP391
[1]YANG J.An analysis of the linguistic family of the nanking people in Vietnam [J].Ideological Front,2012,38(2):133-134.
[2]TAN X,QIN T,SOONG F,et al.A survey on neural speech synthesis [J].arXiv:2106.15561,2021.
[3]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Taco-tron:Towards End-to-End Speech Synthesis[C]//Proceedings of Conference of the International Speech Communication Association.Stockholm,Sweden,2017:4006-4010.
[4]SHEN J,PANG R,WEISS R J,et al.Natural TTD synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783.
[5]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:2000-Speaker Neural Text-to-Speech[C]//Proceedings of the 3rd International Conference on Learning Representations(ICLR).2017:1-15.
[6]OORD A,DIELEMAN S,ZEN H,et al.Wavenet:A generative model for raw audio[J].arXiv:1609.03499,2016.
[7]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural Text-to-Speech[C]//International Conference on Machine Learning.PMLR,2017:195-204.
[8]GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:Multi-speaker neural Text-to-Speech[J].Advances in Neural Information Processing Systems,2017,30:1-15.
[9]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,Robust and Controllable Text to Speech[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:3171-3180.
[10]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and High-Quality End-to-End Text to Speech[C]//Proceedings of the 3rd International Conference on Learning Representations(ICLR).2020:1-15.
[11]GRIFFIN D,LIM J.Signal estimation from modified short-time Fourier transform[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1984,32(2):236-243.
[12]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networks?[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems(Volume 2).2014:3320-3328.
[13]WANG X Z,LI Q L,LI W H.Spatio-temporal model of soil moisture prediction integrated with transfer learning[J].Journal of Jilin University(Engineering and Technology Edition),2022,52(3):675-683.
[14]WANG J f,LIU F,YANG S,et al.Dam Crack Detection Based on Multi-source Transfer Learning[J].Computer Science,2022,49(6A):319-324.
[15]PAN S J,YANG Q.A survey on transfer learning [J].IEEE Transactions on Knowledge and Data Engineering,2009,22(10):1345-1359.
[16]ZHANG Y,WEISS R J,ZEN H,et al.Learning to Speak Fluen-tly in a Foreign Language:Multilingual Speech Synthesis and Cross-Language Voice Cloning[C]//Proceedings of Conference of the International Speech Communication Association.Graz,Austria,2019:2080-2084.
[17]NEKVINDA T,DUŠEK O.One Model,Many Languages:Meta-Learning for Multilingual Text-to-Speech[C]//Proceedings of Conference of the International Speech Communication Association.Shanghai,China,2020:2972-2976.
[18]ZHOU X,TIAN X,LEE G,et al.End-to-end code-switchingTTS with cross-lingual language model[C]//IEEE Interna-tional Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7614-7618.
[19]HAN X,ZHANG Z,DING N,et al.Pre-trained models:past,present and future [J].AI Open,2021,2:225-250.
[20]PAPADIMITRIOU I,CHI E A,FUTRELL R,et al.Deep Subjecthood:Higher-Order Grammatical Features in Multilingual BERT[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:2522-2532.
[21]CHENG F.An Introduction to Modern Vietnamese [D].Nanning:Guangxi University for Nationalities,1988.
[22]SHI Y,BU H,XU X,et al.Aishell-3:A multi-speakerMandarin TTS corpus and the baselines[J].arXiv:2010.11567,2020.
[1] CAI Haoran, YANG Jian, YANG Lin, LIU Cong. Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training [J]. Computer Science, 2023, 50(6A): 220800127-5.
[2] WANG Tianran, WANG Qi, WANG Qingshan. Transfer Learning Based Cross-object Sign Language Gesture Recognition Method [J]. Computer Science, 2023, 50(6A): 220300232-5.
[3] HU Mingyang, GUO Yan, JIN Yangshuang. PSwin:Edge Detection Algorithm Based on Swin Transformer [J]. Computer Science, 2023, 50(6): 194-199.
[4] ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun, SHENG Lei. Survey on Knowledge Transfer Method in Deep Reinforcement Learning [J]. Computer Science, 2023, 50(5): 201-216.
[5] WANG Xiaofei, FAN Xueqiang, LI Zhangwei. Improving RNA Base Interactions Prediction Based on Transfer Learning and Multi-view Feature Fusion [J]. Computer Science, 2023, 50(3): 164-172.
[6] HU Zhongyuan, XUE Yu, ZHA Jiajie. Survey on Evolutionary Recurrent Neural Networks [J]. Computer Science, 2023, 50(3): 254-265.
[7] FENG Chengcheng, LIU Pai, JIANG Linying, MEI Xiaohan, GUO Guibing. Document-enhanced Question Answering over Knowledge-Bases [J]. Computer Science, 2023, 50(3): 266-275.
[8] CAO Jinjuan, QIAN Zhong, LI Peifeng. End-to-End Event Factuality Identification with Joint Model [J]. Computer Science, 2023, 50(2): 292-299.
[9] CHEN Yunfang, LU Yangyang, ZHOU Xin, ZHANG Wei. Multi-object Tracking Based on Cross-correlation Attention and Chained Frames [J]. Computer Science, 2023, 50(1): 131-137.
[10] FANG Yi-qiu, ZHANG Zhen-kun, GE Jun-wei. Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning [J]. Computer Science, 2022, 49(8): 70-77.
[11] WANG Jun-feng, LIU Fan, YANG Sai, LYU Tan-yue, CHEN Zhi-yu, XU Feng. Dam Crack Detection Based on Multi-source Transfer Learning [J]. Computer Science, 2022, 49(6A): 319-324.
[12] LI Sun, CAO Feng. Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology [J]. Computer Science, 2022, 49(6A): 331-336.
[13] PENG Yun-cong, QIN Xiao-lin, ZHANG Li-ge, GU Yong-xiang. Survey on Few-shot Learning Algorithms for Image Classification [J]. Computer Science, 2022, 49(5): 1-9.
[14] AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji. End-to-End Speech Synthesis Based on BERT [J]. Computer Science, 2022, 49(4): 221-226.
[15] TAN Zhen-qiong, JIANG Wen-Jun, YUM Yen-na-cherry, ZHANG Ji, YUM Peter-tak-shing, LI Xiao-hong. Personalized Learning Task Assignment Based on Bipartite Graph [J]. Computer Science, 2022, 49(4): 269-281.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!