计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230500174-7.doi: 10.11896/jsjkx.230500174

• 图像处理&多媒体技术 • 上一篇    下一篇

基于跨语言迁移学习及联合训练的泰语语音合成

张欣瑞, 杨鉴, 王展   

  1. 云南大学信息学院 昆明 650504
  • 发布日期:2024-06-06
  • 通讯作者: 杨鉴(jianyang@ynu.edu.cn)
  • 作者简介:(1090525272@qq.com)
  • 基金资助:
    国家重点研发计划(2020AAA0107901);国家自然科学基金(61961043)

Thai Speech Synthesis Based on Cross-language Transfer Learning and Joint Training

ZHANG Xinrui, YANG Jian, WANG Zhan   

  1. School of Information Science & Engineering,Yunnan University,Kunming 650504,China
  • Published:2024-06-06
  • About author:ZHANG Xinrui,born in 1999,postgra-duate.His main research interests include speech synthesis,recognition and understanding.
    YANG Jian,born in 1964,Ph.D,professor.His main research interests include speech synthesis,recognition and understanding.
  • Supported by:
    National Key Research and Development Program of China(2020AAA0107901) and National Natural Science Foundation of China(61961043).

摘要: 随着深度学习和神经网络的快速发展,基于深度神经网络的端到端语音合成系统因性能优异成为主流。然而近年来,泰语语音合成相关研究还不充分,主要原因是大规模泰语数据集稀缺且该语言拼写方式有其特殊性。为此,在低资源前提下基于FastSpeech2声学模型和StyleMelGAN声码器研究泰语语音合成。针对基线系统中存在的问题,提出了3个改进方法以进一步提高泰语合成语音的质量。(1)在泰语语言专家指导下,结合泰语语言学相关知识设计泰语G2P模型,旨在处理泰语文本中存在的特殊拼写方式;(2)根据所设计的泰语G2P模型转换的国际音标表示的音素,选择拥有相似音素输入单元且数据集丰富的语言进行跨语言迁移学习来解决泰语训练数据不足的问题;(3)采用FastSpeech2和StyleMelGAN声码器联合训练的方法解决声学特征失配的问题。为了验证所提方法的有效性,从注意力对齐图、客观评测MCD和主观评测MOS评分3方面进行测评。实验结果表明,使用所提泰语G2P模型可以获得更好的对齐效果进而得到更准确的音素持续时间,采用“所提泰语G2P模型+联合训练+迁移学习”方法的系统可以获得最好的语音合成质量,合成语音的MCD和MOS评分分别为7.43±0.82分和4.53分,明显优于基线系统的9.47±0.54分和1.14分。

关键词: 语音合成, 低资源, 泰语G2P模型, 迁移学习, 联合训练

Abstract: With the rapid development of deep learning and neural network,end-to-end speech synthesis system based on deep neural network has become the mainstream because of its excellent performance.However,in recent years,there are not enough researches on Thai speech synthesis,which is mainly due to the scarcity of large-scale Thai datasets and the special spelling of the language.This paper studies Thai speech synthesis based on the FastSpeech2 acoustic model and StyleMelGAN vocoder under the premise of low resources.Aiming at the problems existing in the baseline system,three improvement methods are proposed to further improve the quality of Thai synthesized speech.(1)Under the guidance of Thai language experts and combined with relevant knowledge of Thai linguistics,the Thai G2P model is designed to deal with the special spelling in Thai text.(2)According to the phonemes represented by the international phonetic alphabet converted by the designed Thai G2P model,languages with similar phonemes input units and rich data sets are selected for cross-language transfer learning to solve the problem of insufficient Thai training data.(3)The joint training method of FastSpeech2 and StyleMelGAN vocoder is used to solve the problem of acoustic feature mismatch.In order to verify the effectiveness of the proposed methods,this paper measures the attention alignment map,objective evaluation MCD and subjective evaluation MOS score.Experimental results show that using the Thai G2P model designed in this paper can obtain better alignment effect and thus more accurate phoneme duration,and the system using the “Thai G2P model designed in this paper+joint training+transfer learning” method has the best speech synthesis quality,and the MCD and MOS scores of the synthesized speech are 7.43 ± 0.82 and 4.53 points,which are significantly better than the 9.47±0.54 and 1.14 points of the baseline system.

Key words: Speech synthesis, Low resource, Thai G2P model, Transfer learning, Joint training

中图分类号: 

  • TP391
[1]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Taco-tron:Towards end-to-end speech synthesis[J].arXiv:1703.10135,2017.
[2]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783.
[3]REN Y,RUAN Y,TAN X,et al.Fastspeech:Fast,robust and controllable text to speech[C]//Proceesing of the 33rd International Conference on Advances in Neural Information Processing Systems.2019:3171-3180.
[4]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and high-quality end-to-end text to speech[J].arXiv:2006.04558,2020.
[5]CHOMPHAN S,KOBAYASHI T.Implementation and evaluation of an HMM-based Thai speech synthesis system[C]//Eighth Annual Conference of the International Speech Communication Association.2007.
[6]TESPRASIT V,CHAROENPORNSAWAT P,SORNLERT-LAMVANICH V.A context-sensitive homograph disambiguation in Thai text-to-speech synthesis[C]//Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers.2003:103-105.
[7]WAN V,LATORRE J,CHIN K K,et al.Combining multiple high quality corpora for improving HMM-TTS[C]//Thirteenth Annual Conference of the International Speech Communication Association.2012.
[8]OORD A,DIELEMAN S,ZEN H,et al.Wavenet:A generative model for raw audio[J].arXiv:1609.03499,2016.
[9]PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:3617-3621.
[10]KUMAR K,KUMAR R,DE BOISSIERE T,et al.Melgan:Generative adversarial networks for conditional waveform synthesis[J].arXiv:1910.06711,2019.
[11]YAMAMOTO R,SONG E,KIM J M.Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6199-6203.
[12]MUSTAFA A,PIA N,FUCHS G.Stylemelgan:An efficienthigh-fidelity adversarial vocoder with temporal adaptive norma-lization[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:6034-6038.
[13]PARK T,LIU M Y,WANG T C,et al.Semantic image synthesis with spatially-adaptive normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:2337-2346.
[14]NGUYEN T Q.Near-perfect-reconstruction pseudo-QMF banks[J].IEEE Transactions on Signal Processing,1994,42(1):65-76.
[15]QIN Y Y.Analysis of Thai phonetics teaching and teachingstrategies for Chinese students in the primary stage[D].Nanning:Guangxi University,2017.
[16]LIU J,XIE Z,ZHANG C,et al.A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2[J].International Journal of Machine Learning and Cybernetics,2021,12:2809-2823.
[17]SEEHA S,BILAN I,SANCHEZ L M,et al.Thailmcut:Unsupervised pretraining for thai word segmentation[C]//Procee-dings of The 12th Language Resources and Evaluation Confe-rence.2020:6947-6957.
[18]FAHMY F K,KHALIL M I,ABBAS H M.A transfer learning end-to-end arabic text-to-speech(tts) deep architecture[C]//Artificial Neural Networks in Pattern Recognition:9th IAPR TC3 Workshop(ANNPR 2020).Winterthur,Switzerland,Cham:Springer International Publishing,2020:266-277.
[19]XU J,TAN X,REN Y,et al.Lrspeech:Extremely low-resource speech synthesis and recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Disco-very & Data Mining.2020:2802-2812.
[20]HAYASHI T,YAMAMOTO R,YOSHIMURA T,et al.Espnet2-tts:Extending the edge of tts research[J].arXiv:2110.07840,2021.
[21]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[22]WATANABE S,HORI T,KARITA S,et al.Espnet:End-to-end speech processing toolkit[J].arXiv:1804.00015,2018.
[23]KUBICHEK R.Mel-cepstral distance measure for objectivespeech quality assessment[C]//Proceedings of IEEE Pacific RIM Conference on Communications Computers and Signal Processing.IEEE,1993:125-128.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!