计算机科学 ›› 2023, Vol. 50 ›› Issue (8): 118-124.doi: 10.11896/jsjkx.220600045
杨琳1, 杨鉴1, 蔡浩然1, 刘聪2
YANG Lin1, YANG Jian1, CAI Haoran1, LIU Cong2
摘要: 越南语是越南社会主义共和国的官方语言,属南亚语系越芒语族越语支。近年来基于深度学习的语音合成已经能够合成出高质量的语音,然而这类方法通常依赖于大规模的高质量语音训练数据。解决某些低资源非通用语语音训练数据不足问题的一种有效途径为:采用迁移学习方法并借用其他高资源通用语语音数据。在低资源条件下,以提高越南语语音合成质量为目标,选用端到端语音合成模型Tacotorn2作为基线模型,采用迁移学习方法研究不同源语言和不同文本字符嵌入方式、迁移学习方式对语音合成效果的影响;然后从主观和客观两方面对文中阐述的各种模型所合成的语音进行测评。实验结果表明,基于英语音素嵌入+越南语音素嵌入方式的迁移学习系统在合成自然可懂的越南语语音上取得了较好的结果,合成语音的MOS评分可达4.11分,远高于基线系统的2.53分。
中图分类号:
| [1]YANG J.An analysis of the linguistic family of the nanking people in Vietnam [J].Ideological Front,2012,38(2):133-134. [2]TAN X,QIN T,SOONG F,et al.A survey on neural speech synthesis [J].arXiv:2106.15561,2021. [3]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Taco-tron:Towards End-to-End Speech Synthesis[C]//Proceedings of Conference of the International Speech Communication Association.Stockholm,Sweden,2017:4006-4010. [4]SHEN J,PANG R,WEISS R J,et al.Natural TTD synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783. [5]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:2000-Speaker Neural Text-to-Speech[C]//Proceedings of the 3rd International Conference on Learning Representations(ICLR).2017:1-15. [6]OORD A,DIELEMAN S,ZEN H,et al.Wavenet:A generative model for raw audio[J].arXiv:1609.03499,2016. [7]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural Text-to-Speech[C]//International Conference on Machine Learning.PMLR,2017:195-204. [8]GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:Multi-speaker neural Text-to-Speech[J].Advances in Neural Information Processing Systems,2017,30:1-15. [9]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,Robust and Controllable Text to Speech[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:3171-3180. [10]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and High-Quality End-to-End Text to Speech[C]//Proceedings of the 3rd International Conference on Learning Representations(ICLR).2020:1-15. [11]GRIFFIN D,LIM J.Signal estimation from modified short-time Fourier transform[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1984,32(2):236-243. [12]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networks?[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems(Volume 2).2014:3320-3328. [13]WANG X Z,LI Q L,LI W H.Spatio-temporal model of soil moisture prediction integrated with transfer learning[J].Journal of Jilin University(Engineering and Technology Edition),2022,52(3):675-683. [14]WANG J f,LIU F,YANG S,et al.Dam Crack Detection Based on Multi-source Transfer Learning[J].Computer Science,2022,49(6A):319-324. [15]PAN S J,YANG Q.A survey on transfer learning [J].IEEE Transactions on Knowledge and Data Engineering,2009,22(10):1345-1359. [16]ZHANG Y,WEISS R J,ZEN H,et al.Learning to Speak Fluen-tly in a Foreign Language:Multilingual Speech Synthesis and Cross-Language Voice Cloning[C]//Proceedings of Conference of the International Speech Communication Association.Graz,Austria,2019:2080-2084. [17]NEKVINDA T,DUEK O.One Model,Many Languages:Meta-Learning for Multilingual Text-to-Speech[C]//Proceedings of Conference of the International Speech Communication Association.Shanghai,China,2020:2972-2976. [18]ZHOU X,TIAN X,LEE G,et al.End-to-end code-switchingTTS with cross-lingual language model[C]//IEEE Interna-tional Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7614-7618. [19]HAN X,ZHANG Z,DING N,et al.Pre-trained models:past,present and future [J].AI Open,2021,2:225-250. [20]PAPADIMITRIOU I,CHI E A,FUTRELL R,et al.Deep Subjecthood:Higher-Order Grammatical Features in Multilingual BERT[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:2522-2532. [21]CHENG F.An Introduction to Modern Vietnamese [D].Nanning:Guangxi University for Nationalities,1988. [22]SHI Y,BU H,XU X,et al.Aishell-3:A multi-speakerMandarin TTS corpus and the baselines[J].arXiv:2010.11567,2020. | 
| 
 | ||