计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230500174-7.doi: 10.11896/jsjkx.230500174
张欣瑞, 杨鉴, 王展
ZHANG Xinrui, YANG Jian, WANG Zhan
摘要: 随着深度学习和神经网络的快速发展,基于深度神经网络的端到端语音合成系统因性能优异成为主流。然而近年来,泰语语音合成相关研究还不充分,主要原因是大规模泰语数据集稀缺且该语言拼写方式有其特殊性。为此,在低资源前提下基于FastSpeech2声学模型和StyleMelGAN声码器研究泰语语音合成。针对基线系统中存在的问题,提出了3个改进方法以进一步提高泰语合成语音的质量。(1)在泰语语言专家指导下,结合泰语语言学相关知识设计泰语G2P模型,旨在处理泰语文本中存在的特殊拼写方式;(2)根据所设计的泰语G2P模型转换的国际音标表示的音素,选择拥有相似音素输入单元且数据集丰富的语言进行跨语言迁移学习来解决泰语训练数据不足的问题;(3)采用FastSpeech2和StyleMelGAN声码器联合训练的方法解决声学特征失配的问题。为了验证所提方法的有效性,从注意力对齐图、客观评测MCD和主观评测MOS评分3方面进行测评。实验结果表明,使用所提泰语G2P模型可以获得更好的对齐效果进而得到更准确的音素持续时间,采用“所提泰语G2P模型+联合训练+迁移学习”方法的系统可以获得最好的语音合成质量,合成语音的MCD和MOS评分分别为7.43±0.82分和4.53分,明显优于基线系统的9.47±0.54分和1.14分。
中图分类号:
[1]WANG Y,SKERRY-RYAN R J,STANTON D,et al.Taco-tron:Towards end-to-end speech synthesis[J].arXiv:1703.10135,2017. [2]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783. [3]REN Y,RUAN Y,TAN X,et al.Fastspeech:Fast,robust and controllable text to speech[C]//Proceesing of the 33rd International Conference on Advances in Neural Information Processing Systems.2019:3171-3180. [4]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and high-quality end-to-end text to speech[J].arXiv:2006.04558,2020. [5]CHOMPHAN S,KOBAYASHI T.Implementation and evaluation of an HMM-based Thai speech synthesis system[C]//Eighth Annual Conference of the International Speech Communication Association.2007. [6]TESPRASIT V,CHAROENPORNSAWAT P,SORNLERT-LAMVANICH V.A context-sensitive homograph disambiguation in Thai text-to-speech synthesis[C]//Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers.2003:103-105. [7]WAN V,LATORRE J,CHIN K K,et al.Combining multiple high quality corpora for improving HMM-TTS[C]//Thirteenth Annual Conference of the International Speech Communication Association.2012. [8]OORD A,DIELEMAN S,ZEN H,et al.Wavenet:A generative model for raw audio[J].arXiv:1609.03499,2016. [9]PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:3617-3621. [10]KUMAR K,KUMAR R,DE BOISSIERE T,et al.Melgan:Generative adversarial networks for conditional waveform synthesis[J].arXiv:1910.06711,2019. [11]YAMAMOTO R,SONG E,KIM J M.Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6199-6203. [12]MUSTAFA A,PIA N,FUCHS G.Stylemelgan:An efficienthigh-fidelity adversarial vocoder with temporal adaptive norma-lization[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:6034-6038. [13]PARK T,LIU M Y,WANG T C,et al.Semantic image synthesis with spatially-adaptive normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:2337-2346. [14]NGUYEN T Q.Near-perfect-reconstruction pseudo-QMF banks[J].IEEE Transactions on Signal Processing,1994,42(1):65-76. [15]QIN Y Y.Analysis of Thai phonetics teaching and teachingstrategies for Chinese students in the primary stage[D].Nanning:Guangxi University,2017. [16]LIU J,XIE Z,ZHANG C,et al.A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2[J].International Journal of Machine Learning and Cybernetics,2021,12:2809-2823. [17]SEEHA S,BILAN I,SANCHEZ L M,et al.Thailmcut:Unsupervised pretraining for thai word segmentation[C]//Procee-dings of The 12th Language Resources and Evaluation Confe-rence.2020:6947-6957. [18]FAHMY F K,KHALIL M I,ABBAS H M.A transfer learning end-to-end arabic text-to-speech(tts) deep architecture[C]//Artificial Neural Networks in Pattern Recognition:9th IAPR TC3 Workshop(ANNPR 2020).Winterthur,Switzerland,Cham:Springer International Publishing,2020:266-277. [19]XU J,TAN X,REN Y,et al.Lrspeech:Extremely low-resource speech synthesis and recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Disco-very & Data Mining.2020:2802-2812. [20]HAYASHI T,YAMAMOTO R,YOSHIMURA T,et al.Espnet2-tts:Extending the edge of tts research[J].arXiv:2110.07840,2021. [21]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033. [22]WATANABE S,HORI T,KARITA S,et al.Espnet:End-to-end speech processing toolkit[J].arXiv:1804.00015,2018. [23]KUBICHEK R.Mel-cepstral distance measure for objectivespeech quality assessment[C]//Proceedings of IEEE Pacific RIM Conference on Communications Computers and Signal Processing.IEEE,1993:125-128. |
|