基于交替训练及预训练的低资源泰语语音合成

doi:10.11896/jsjkx.220800127

计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220800127-5.doi: 10.11896/jsjkx.220800127

• 图像处理&多媒体技术 • 上一篇下一篇

基于交替训练及预训练的低资源泰语语音合成

蔡浩然¹, 杨鉴¹, 杨琳¹, 刘聪²

1 云南大学信息学院昆明 650504;
2 科大讯飞股份有限公司人工智能研究院合肥 230088

出版日期:2023-06-10 发布日期:2023-06-12
通讯作者: 杨鉴(jianyang@ynu.edu.cn)
作者简介:(chr164663553@163.com)
基金资助:
国家重点研发计划(2020AAA0107901)

Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training

CAI Haoran¹, YANG Jian¹, YANG Lin¹, LIU Cong²

1 School of Information Science & Engineering,Yunnan University,Kunming 650504,China;
2 AI Research Institute,iFLYTEK Co.,Ltd.,Hefei 230088,China

Online:2023-06-10 Published:2023-06-12
About author:CAI Haoran,born in 1997,postgra-duate.His main research interests include speech synthesis,recognition and understanding. YANG Jian,born in 1964,Ph.D,professor.His main research interests include speech synthesis,recognition and understanding.
Supported by:
National Key Research and Development Program of China(2020AAA0107901).

摘要/Abstract

摘要： 泰语作为一种有数千万人口使用的语言,应用较为广泛,20世纪90年代末就有学者开展了泰语语音合成的研究。近年来,基于深度神经网络并利用大规模高质量“文本-音频”数据训练的端到端语音合成系统,已经能够合成出高质量的语音。目前,汉语、英语等通用语已拥有海量的语音合成数据库,然而泰语作为一种非通用语可获取的“文本-音频”数据库规模往往较小。在低资源条件下,以提高泰语语音合成质量为目标,选用端到端语音合成模型Tacotron2作为基线模型,研究交替训练方法以及预训练方法,研究不同文本嵌入方式对泰语语音合成效果的影响;然后从注意力对齐图和MOS评分两方面对文中设计的6种模型所合成的语音进行测评。实验结果表明,采用“元辅音嵌入+预训练+交替训练”方法的系统的语音合成质量最好,合成语音的MOS评分可达3.95分,明显优于基线系统的1.71分。

关键词: 语音合成, 泰语, 低资源, 交替训练, 预训练

Abstract: As a language spoken by tens of millions of people,Thai is widely used.In the late 1990s,some scholars carried out research on Thai speech synthesis.In recent years,end-to-end speech synthesis systems based on deep neural networks and trained with large-scale high-quality “text-audio” data have been able to synthesize high-quality speech.At present,Chinese,English and other common languages have massive speech synthesis databases.However,the “text-audio” database available for Thai as a non-common language is often small in scale.Under the condition of low resources,this paper aims to improve the quality of Thai speech synthesis,selects the end-to-end speech synthesis model Tacotorn2 as the baseline model,studies the alternate training method and pre-training method,and studies the effect of different text embedding methods on the effect of Thai speech synthesis.Then,the speech synthesized by the six models designed in this paper is evaluated from the attention alignment map and the MOS score.Experimental results show that the system using the method of “vowel consonant embedding+pre-training+alternate training” has the best speech synthesis quality,and the MOS score of the synthesized speech can reach 3.95,which is significantly better than the baseline system’s 1.71.

Key words: Speech synthesis, Thai, Low resource, Alternate training, Pre-training

中图分类号:

TP391

蔡浩然, 杨鉴, 杨琳, 刘聪. 基于交替训练及预训练的低资源泰语语音合成[J]. 计算机科学, 2023, 50(6A): 220800127-5. https://doi.org/10.11896/jsjkx.220800127

CAI Haoran, YANG Jian, YANG Lin, LIU Cong. Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training[J]. Computer Science, 2023, 50(6A): 220800127-5. https://doi.org/10.11896/jsjkx.220800127

参考文献

[1]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.DeepVoice:Real-Time Neural Text-to-Speech[C]//International Conference on Machine Learning.Singapore:PMLR,2017:195-204.
[2]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Proceedings of Conference of the International Speech Communication Association.Stockholm,2017:4006-4010.
[3]SHEN J,PANG R,WEISS R J,et al.Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.Calgary:IEEE,2018:4779-4783.
[4]REN Y,HU C,TAN X,et al.FastSpeech 2:Fast and High-Quality End-to-End Text to Speech[J].arXiv:2006.04558,2020.
[5]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:2000-Speaker Neural Text-to-Speech[C]//Proceedings of the 3rd International Conference on Learning Representations.Toulon,2017:1-15.
[6]WUTIWIWATCHAI C,HANSAKUNBUNTHEUNG C,RUGC-HATJAROEN A,et al.Thai text-to-speech synthesis:a review[J].Journal of Intelligent Informatics and Smart Technology,2017,2(2):1-8.
[7]CHOMPHAN S,KOBAYASHI T.Implementation and Evaluation of An HMM-Based Thai Speech Synthesis System[C]//Proceedings of Conference of the International Speech Communication Association.Antwerp,2007:2849-2852.
[8]TESPRASIT V,CHAROENPORNSAWAT P,SORNLERT-LAMVANICH V.A context-sensitive homograph disambiguation in Thai text-to-speech synthesis[C]//Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers.Asso-ciation for Computational Linguistics.Edmonton,2003:103-105.
[9]WAN V,LATORRE J,CHIN K K,et al.Combining multiple high quality corpora for improving HMM-TTS[C]//Procee-dings of Conference of the International Speech Communication Association.Portland,2012:1135-1138.
[10]LIU R,SISMAN B,LI J,et al.Teacher-student training for robust Tacotron-based TTS[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.Barcelona:IEEE,2020:6274-6278.
[11]LIU R,YANG J,LIU M.A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2[C]//Proceedings of the 2019 International Symposium on Signal Processing Systems.Beijing,2019:46-50.
[12]FAHMY F K,KHALIL M I,ABBAS H M.A Transfer Lear-ning End-to-End Arabic Text-to-Speech(TTS) Deep Architecture[C]//Workshop on Artificial Neural Networks in Pattern Recognition.Winterthur:Springer,2020:266-277.
[13]XU J,TAN X,REN Y,et al.LRSPEECH:Extremely Low-Resource Speech Synthesis and Recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.USA,2020:2802-2812.
[14]TIS 620-2533,Standard for Thai Character Codes for Computers[S].Bangkok:Thai Industrial Standard Institute,1990.
[15]LIU J,XIE Z,ZHANG C,et al.A Novel Method for Mandarin Speech Synthesis by Inserting Prosodic Structure Prediction into Tacotron2[J].International Journal of Machine Learning and Cybernetics,2021,12(10):2809-2823.
[16]LIU H J,YANG J,XIONG Y J,et al.Implementation of Word Segmentation and Romanization for Thai Text[C]//NCMMSC’2013.Guiyang,2013.
[17]KEITH I,LINDA J.The LJ Speech Dataset[OL].https://keithito.com/LJ-Speech-Dataset/.
[18]PRENGER R,VALLE R,CATANZARO B.WaveGlow:AFlow-Based Generative Network for Speech Synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing.Brighton:IEEE,2019:3617-3621.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于交替训练及预训练的低资源泰语语音合成

Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0