Computer Science ›› 2024, Vol. 51 ›› Issue (6A): 230500174-7.doi: 10.11896/jsjkx.230500174

• Image Processing & Multimedia Technolog • Previous Articles     Next Articles

Thai Speech Synthesis Based on Cross-language Transfer Learning and Joint Training

ZHANG Xinrui, YANG Jian, WANG Zhan   

  1. School of Information Science & Engineering,Yunnan University,Kunming 650504,China
  • Published:2024-06-06
  • About author:ZHANG Xinrui,born in 1999,postgra-duate.His main research interests include speech synthesis,recognition and understanding.
    YANG Jian,born in 1964,Ph.D,professor.His main research interests include speech synthesis,recognition and understanding.
  • Supported by:
    National Key Research and Development Program of China(2020AAA0107901) and National Natural Science Foundation of China(61961043).

Abstract: With the rapid development of deep learning and neural network,end-to-end speech synthesis system based on deep neural network has become the mainstream because of its excellent performance.However,in recent years,there are not enough researches on Thai speech synthesis,which is mainly due to the scarcity of large-scale Thai datasets and the special spelling of the language.This paper studies Thai speech synthesis based on the FastSpeech2 acoustic model and StyleMelGAN vocoder under the premise of low resources.Aiming at the problems existing in the baseline system,three improvement methods are proposed to further improve the quality of Thai synthesized speech.(1)Under the guidance of Thai language experts and combined with relevant knowledge of Thai linguistics,the Thai G2P model is designed to deal with the special spelling in Thai text.(2)According to the phonemes represented by the international phonetic alphabet converted by the designed Thai G2P model,languages with similar phonemes input units and rich data sets are selected for cross-language transfer learning to solve the problem of insufficient Thai training data.(3)The joint training method of FastSpeech2 and StyleMelGAN vocoder is used to solve the problem of acoustic feature mismatch.In order to verify the effectiveness of the proposed methods,this paper measures the attention alignment map,objective evaluation MCD and subjective evaluation MOS score.Experimental results show that using the Thai G2P model designed in this paper can obtain better alignment effect and thus more accurate phoneme duration,and the system using the “Thai G2P model designed in this paper+joint training+transfer learning” method has the best speech synthesis quality,and the MCD and MOS scores of the synthesized speech are 7.43 ± 0.82 and 4.53 points,which are significantly better than the 9.47±0.54 and 1.14 points of the baseline system.

Key words: Speech synthesis, Low resource, Thai G2P model, Transfer learning, Joint training

CLC Number: 

  • TP391
[1]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Taco-tron:Towards end-to-end speech synthesis[J].arXiv:1703.10135,2017.
[2]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783.
[3]REN Y,RUAN Y,TAN X,et al.Fastspeech:Fast,robust and controllable text to speech[C]//Proceesing of the 33rd International Conference on Advances in Neural Information Processing Systems.2019:3171-3180.
[4]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and high-quality end-to-end text to speech[J].arXiv:2006.04558,2020.
[5]CHOMPHAN S,KOBAYASHI T.Implementation and evaluation of an HMM-based Thai speech synthesis system[C]//Eighth Annual Conference of the International Speech Communication Association.2007.
[6]TESPRASIT V,CHAROENPORNSAWAT P,SORNLERT-LAMVANICH V.A context-sensitive homograph disambiguation in Thai text-to-speech synthesis[C]//Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers.2003:103-105.
[7]WAN V,LATORRE J,CHIN K K,et al.Combining multiple high quality corpora for improving HMM-TTS[C]//Thirteenth Annual Conference of the International Speech Communication Association.2012.
[8]OORD A,DIELEMAN S,ZEN H,et al.Wavenet:A generative model for raw audio[J].arXiv:1609.03499,2016.
[9]PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:3617-3621.
[10]KUMAR K,KUMAR R,DE BOISSIERE T,et al.Melgan:Generative adversarial networks for conditional waveform synthesis[J].arXiv:1910.06711,2019.
[11]YAMAMOTO R,SONG E,KIM J M.Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6199-6203.
[12]MUSTAFA A,PIA N,FUCHS G.Stylemelgan:An efficienthigh-fidelity adversarial vocoder with temporal adaptive norma-lization[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:6034-6038.
[13]PARK T,LIU M Y,WANG T C,et al.Semantic image synthesis with spatially-adaptive normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:2337-2346.
[14]NGUYEN T Q.Near-perfect-reconstruction pseudo-QMF banks[J].IEEE Transactions on Signal Processing,1994,42(1):65-76.
[15]QIN Y Y.Analysis of Thai phonetics teaching and teachingstrategies for Chinese students in the primary stage[D].Nanning:Guangxi University,2017.
[16]LIU J,XIE Z,ZHANG C,et al.A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2[J].International Journal of Machine Learning and Cybernetics,2021,12:2809-2823.
[17]SEEHA S,BILAN I,SANCHEZ L M,et al.Thailmcut:Unsupervised pretraining for thai word segmentation[C]//Procee-dings of The 12th Language Resources and Evaluation Confe-rence.2020:6947-6957.
[18]FAHMY F K,KHALIL M I,ABBAS H M.A transfer learning end-to-end arabic text-to-speech(tts) deep architecture[C]//Artificial Neural Networks in Pattern Recognition:9th IAPR TC3 Workshop(ANNPR 2020).Winterthur,Switzerland,Cham:Springer International Publishing,2020:266-277.
[19]XU J,TAN X,REN Y,et al.Lrspeech:Extremely low-resource speech synthesis and recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Disco-very & Data Mining.2020:2802-2812.
[20]HAYASHI T,YAMAMOTO R,YOSHIMURA T,et al.Espnet2-tts:Extending the edge of tts research[J].arXiv:2110.07840,2021.
[21]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[22]WATANABE S,HORI T,KARITA S,et al.Espnet:End-to-end speech processing toolkit[J].arXiv:1804.00015,2018.
[23]KUBICHEK R.Mel-cepstral distance measure for objectivespeech quality assessment[C]//Proceedings of IEEE Pacific RIM Conference on Communications Computers and Signal Processing.IEEE,1993:125-128.
[1] CAO Yan, ZHU Zhenfeng. DRSTN:Deep Residual Soft Thresholding Network [J]. Computer Science, 2024, 51(6A): 230400112-7.
[2] WANG Jiahao, FU Yifu, FENG Hainan, REN Yuheng. Indoor Location Algorithm in Dynamic Environment Based on Transfer Learning [J]. Computer Science, 2024, 51(5): 277-283.
[3] WU Kewei, HAN Chao, SUN Yongxuan, PENG Menghao, XIE Zhao. Hierarchical Conformer Based Speech Synthesis [J]. Computer Science, 2024, 51(2): 161-171.
[4] YANG Lin, YANG Jian, CAI Haoran, LIU Cong. Vietnamese Speech Synthesis Based on Transfer Learning [J]. Computer Science, 2023, 50(8): 118-124.
[5] XIAO Guiyang, WANG Lisong , JIANG Guohua. Multimodal Knowledge Graph Embedding with Text-Image Enhancement [J]. Computer Science, 2023, 50(8): 163-169.
[6] CAI Haoran, YANG Jian, YANG Lin, LIU Cong. Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training [J]. Computer Science, 2023, 50(6A): 220800127-5.
[7] WANG Tianran, WANG Qi, WANG Qingshan. Transfer Learning Based Cross-object Sign Language Gesture Recognition Method [J]. Computer Science, 2023, 50(6A): 220300232-5.
[8] HU Mingyang, GUO Yan, JIN Yangshuang. PSwin:Edge Detection Algorithm Based on Swin Transformer [J]. Computer Science, 2023, 50(6): 194-199.
[9] ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun, SHENG Lei. Survey on Knowledge Transfer Method in Deep Reinforcement Learning [J]. Computer Science, 2023, 50(5): 201-216.
[10] WANG Xiaofei, FAN Xueqiang, LI Zhangwei. Improving RNA Base Interactions Prediction Based on Transfer Learning and Multi-view Feature Fusion [J]. Computer Science, 2023, 50(3): 164-172.
[11] HU Zhongyuan, XUE Yu, ZHA Jiajie. Survey on Evolutionary Recurrent Neural Networks [J]. Computer Science, 2023, 50(3): 254-265.
[12] FANG Yi-qiu, ZHANG Zhen-kun, GE Jun-wei. Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning [J]. Computer Science, 2022, 49(8): 70-77.
[13] WANG Jun-feng, LIU Fan, YANG Sai, LYU Tan-yue, CHEN Zhi-yu, XU Feng. Dam Crack Detection Based on Multi-source Transfer Learning [J]. Computer Science, 2022, 49(6A): 319-324.
[14] PENG Yun-cong, QIN Xiao-lin, ZHANG Li-ge, GU Yong-xiang. Survey on Few-shot Learning Algorithms for Image Classification [J]. Computer Science, 2022, 49(5): 1-9.
[15] AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji. End-to-End Speech Synthesis Based on BERT [J]. Computer Science, 2022, 49(4): 221-226.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!