计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220800127-5.doi: 10.11896/jsjkx.220800127
蔡浩然1, 杨鉴1, 杨琳1, 刘聪2
CAI Haoran1, YANG Jian1, YANG Lin1, LIU Cong2
摘要: 泰语作为一种有数千万人口使用的语言,应用较为广泛,20世纪90年代末就有学者开展了泰语语音合成的研究。近年来,基于深度神经网络并利用大规模高质量“文本-音频”数据训练的端到端语音合成系统,已经能够合成出高质量的语音。目前,汉语、英语等通用语已拥有海量的语音合成数据库,然而泰语作为一种非通用语可获取的“文本-音频”数据库规模往往较小。在低资源条件下,以提高泰语语音合成质量为目标,选用端到端语音合成模型Tacotron2作为基线模型,研究交替训练方法以及预训练方法,研究不同文本嵌入方式对泰语语音合成效果的影响;然后从注意力对齐图和MOS评分两方面对文中设计的6种模型所合成的语音进行测评。实验结果表明,采用“元辅音嵌入+预训练+交替训练”方法的系统的语音合成质量最好,合成语音的MOS评分可达3.95分,明显优于基线系统的1.71分。
中图分类号:
[1]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.DeepVoice:Real-Time Neural Text-to-Speech[C]//International Conference on Machine Learning.Singapore:PMLR,2017:195-204. [2]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Proceedings of Conference of the International Speech Communication Association.Stockholm,2017:4006-4010. [3]SHEN J,PANG R,WEISS R J,et al.Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.Calgary:IEEE,2018:4779-4783. [4]REN Y,HU C,TAN X,et al.FastSpeech 2:Fast and High-Quality End-to-End Text to Speech[J].arXiv:2006.04558,2020. [5]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:2000-Speaker Neural Text-to-Speech[C]//Proceedings of the 3rd International Conference on Learning Representations.Toulon,2017:1-15. [6]WUTIWIWATCHAI C,HANSAKUNBUNTHEUNG C,RUGC-HATJAROEN A,et al.Thai text-to-speech synthesis:a review[J].Journal of Intelligent Informatics and Smart Technology,2017,2(2):1-8. [7]CHOMPHAN S,KOBAYASHI T.Implementation and Evaluation of An HMM-Based Thai Speech Synthesis System[C]//Proceedings of Conference of the International Speech Communication Association.Antwerp,2007:2849-2852. [8]TESPRASIT V,CHAROENPORNSAWAT P,SORNLERT-LAMVANICH V.A context-sensitive homograph disambiguation in Thai text-to-speech synthesis[C]//Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers.Asso-ciation for Computational Linguistics.Edmonton,2003:103-105. [9]WAN V,LATORRE J,CHIN K K,et al.Combining multiple high quality corpora for improving HMM-TTS[C]//Procee-dings of Conference of the International Speech Communication Association.Portland,2012:1135-1138. [10]LIU R,SISMAN B,LI J,et al.Teacher-student training for robust Tacotron-based TTS[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.Barcelona:IEEE,2020:6274-6278. [11]LIU R,YANG J,LIU M.A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2[C]//Proceedings of the 2019 International Symposium on Signal Processing Systems.Beijing,2019:46-50. [12]FAHMY F K,KHALIL M I,ABBAS H M.A Transfer Lear-ning End-to-End Arabic Text-to-Speech(TTS) Deep Architecture[C]//Workshop on Artificial Neural Networks in Pattern Recognition.Winterthur:Springer,2020:266-277. [13]XU J,TAN X,REN Y,et al.LRSPEECH:Extremely Low-Resource Speech Synthesis and Recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.USA,2020:2802-2812. [14]TIS 620-2533,Standard for Thai Character Codes for Computers[S].Bangkok:Thai Industrial Standard Institute,1990. [15]LIU J,XIE Z,ZHANG C,et al.A Novel Method for Mandarin Speech Synthesis by Inserting Prosodic Structure Prediction into Tacotron2[J].International Journal of Machine Learning and Cybernetics,2021,12(10):2809-2823. [16]LIU H J,YANG J,XIONG Y J,et al.Implementation of Word Segmentation and Romanization for Thai Text[C]//NCMMSC’2013.Guiyang,2013. [17]KEITH I,LINDA J.The LJ Speech Dataset[OL].https://keithito.com/LJ-Speech-Dataset/. [18]PRENGER R,VALLE R,CATANZARO B.WaveGlow:AFlow-Based Generative Network for Speech Synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing.Brighton:IEEE,2019:3617-3621. |
|