基于交替训练及预训练的低资源泰语语音合成

doi:10.11896/jsjkx.220800127

Abstract

Abstract: As a language spoken by tens of millions of people,Thai is widely used.In the late 1990s,some scholars carried out research on Thai speech synthesis.In recent years,end-to-end speech synthesis systems based on deep neural networks and trained with large-scale high-quality “text-audio” data have been able to synthesize high-quality speech.At present,Chinese,English and other common languages have massive speech synthesis databases.However,the “text-audio” database available for Thai as a non-common language is often small in scale.Under the condition of low resources,this paper aims to improve the quality of Thai speech synthesis,selects the end-to-end speech synthesis model Tacotorn2 as the baseline model,studies the alternate training method and pre-training method,and studies the effect of different text embedding methods on the effect of Thai speech synthesis.Then,the speech synthesized by the six models designed in this paper is evaluated from the attention alignment map and the MOS score.Experimental results show that the system using the method of “vowel consonant embedding+pre-training+alternate training” has the best speech synthesis quality,and the MOS score of the synthesized speech can reach 3.95,which is significantly better than the baseline system’s 1.71.

Key words: Speech synthesis, Thai, Low resource, Alternate training, Pre-training

CLC Number:

TP391

CAI Haoran, YANG Jian, YANG Lin, LIU Cong. Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training[J].Computer Science, 2023, 50(6A): 220800127-5.

References

[1]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.DeepVoice:Real-Time Neural Text-to-Speech[C]//International Conference on Machine Learning.Singapore:PMLR,2017:195-204.
[2]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Proceedings of Conference of the International Speech Communication Association.Stockholm,2017:4006-4010.
[3]SHEN J,PANG R,WEISS R J,et al.Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.Calgary:IEEE,2018:4779-4783.
[4]REN Y,HU C,TAN X,et al.FastSpeech 2:Fast and High-Quality End-to-End Text to Speech[J].arXiv:2006.04558,2020.
[5]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:2000-Speaker Neural Text-to-Speech[C]//Proceedings of the 3rd International Conference on Learning Representations.Toulon,2017:1-15.
[6]WUTIWIWATCHAI C,HANSAKUNBUNTHEUNG C,RUGC-HATJAROEN A,et al.Thai text-to-speech synthesis:a review[J].Journal of Intelligent Informatics and Smart Technology,2017,2(2):1-8.
[7]CHOMPHAN S,KOBAYASHI T.Implementation and Evaluation of An HMM-Based Thai Speech Synthesis System[C]//Proceedings of Conference of the International Speech Communication Association.Antwerp,2007:2849-2852.
[8]TESPRASIT V,CHAROENPORNSAWAT P,SORNLERT-LAMVANICH V.A context-sensitive homograph disambiguation in Thai text-to-speech synthesis[C]//Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers.Asso-ciation for Computational Linguistics.Edmonton,2003:103-105.
[9]WAN V,LATORRE J,CHIN K K,et al.Combining multiple high quality corpora for improving HMM-TTS[C]//Procee-dings of Conference of the International Speech Communication Association.Portland,2012:1135-1138.
[10]LIU R,SISMAN B,LI J,et al.Teacher-student training for robust Tacotron-based TTS[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.Barcelona:IEEE,2020:6274-6278.
[11]LIU R,YANG J,LIU M.A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2[C]//Proceedings of the 2019 International Symposium on Signal Processing Systems.Beijing,2019:46-50.
[12]FAHMY F K,KHALIL M I,ABBAS H M.A Transfer Lear-ning End-to-End Arabic Text-to-Speech(TTS) Deep Architecture[C]//Workshop on Artificial Neural Networks in Pattern Recognition.Winterthur:Springer,2020:266-277.
[13]XU J,TAN X,REN Y,et al.LRSPEECH:Extremely Low-Resource Speech Synthesis and Recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.USA,2020:2802-2812.
[14]TIS 620-2533,Standard for Thai Character Codes for Computers[S].Bangkok:Thai Industrial Standard Institute,1990.
[15]LIU J,XIE Z,ZHANG C,et al.A Novel Method for Mandarin Speech Synthesis by Inserting Prosodic Structure Prediction into Tacotron2[J].International Journal of Machine Learning and Cybernetics,2021,12(10):2809-2823.
[16]LIU H J,YANG J,XIONG Y J,et al.Implementation of Word Segmentation and Romanization for Thai Text[C]//NCMMSC’2013.Guiyang,2013.
[17]KEITH I,LINDA J.The LJ Speech Dataset[OL].https://keithito.com/LJ-Speech-Dataset/.
[18]PRENGER R,VALLE R,CATANZARO B.WaveGlow:AFlow-Based Generative Network for Speech Synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing.Brighton:IEEE,2019:3617-3621.

Related Articles 15

[1]	YANG Lin, YANG Jian, CAI Haoran, LIU Cong. Vietnamese Speech Synthesis Based on Transfer Learning [J]. Computer Science, 2023, 50(8): 118-124.
[2]	WANG Taiyan, PAN Zulie, YU Lu, SONG Jingbin. Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation [J]. Computer Science, 2023, 50(4): 288-297.
[3]	LIU Zhe, YIN Chengfeng, LI Tianrui. Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding [J]. Computer Science, 2023, 50(3): 282-290.
[4]	SU Qi, WANG Hongling, WANG Zhongqing. Unsupervised Script Summarization Based on Pre-trained Model [J]. Computer Science, 2023, 50(2): 310-316.
[5]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[6]	ZHAO Dan-dan, HUANG De-gen, MENG Jia-na, DONG Yu, ZHANG Pan. Chinese Entity Relations Classification Based on BERT-GRU-ATT [J]. Computer Science, 2022, 49(6): 319-325.
[7]	AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji. End-to-End Speech Synthesis Based on BERT [J]. Computer Science, 2022, 49(4): 221-226.
[8]	LIU Shuo, WANG Geng-run, PENG Jian-hua, LI Ke. Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words [J]. Computer Science, 2022, 49(4): 282-287.
[9]	HUANG Yu-jiao, ZHAN Li-chao, FAN Xing-gang, XIAO Jie, LONG Hai-xia. Text Classification Based on Knowledge Distillation Model ELECTRA-base-BiLSTM [J]. Computer Science, 2022, 49(11A): 211200181-6.
[10]	CHEN Qiao-song, HE Xiao-yang, XU Wen-jie, DENG Xin, WANG Jin, PIAO Chang-hao. Reentrancy Vulnerability Detection Based on Pre-training Technology and Expert Knowledge [J]. Computer Science, 2022, 49(11A): 211200182-8.
[11]	Abudukelimu ABULIZI, ZHANG Yu-ning, Alimujiang YASEN, GUO Wen-qiang, Abudukelimu HALIDANMU. Survey of Research on Extended Models of Pre-trained Language Models [J]. Computer Science, 2022, 49(11A): 210800125-12.
[12]	HOU Hong-xu, SUN Shuo, WU Nier. Survey of Mongolian-Chinese Neural Machine Translation [J]. Computer Science, 2022, 49(1): 31-40.
[13]	LIU Chuang, XIONG De-yi. Survey of Multilingual Question Answering [J]. Computer Science, 2022, 49(1): 65-72.
[14]	PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin. Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning [J]. Computer Science, 2021, 48(8): 200-208.
[15]	PAN Fang, ZHANG Hui-bing, DONG Jun-chao, SHOU Zhao-yu. Aspect Sentiment Analysis of Chinese Online Course Review Based on Efficient Transformer [J]. Computer Science, 2021, 48(6A): 264-269.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0