计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 161-171.doi: 10.11896/jsjkx.221100125
吴克伟1,2,3, 韩超3, 孙永宣1,2,3, 彭梦昊3, 谢昭1,2,3
WU Kewei1,2,3, HAN Chao3, SUN Yongxuan1,2,3, PENG Menghao3, XIE Zhao1,2,3
摘要: 语音合成需要将输入语句的文本转换为包含音素、单词和语句的语音信号。现有语音合成方法将语句看作一个整体,难以准确地合成出不同长度的语音信号。通过分析语音信号中蕴含的层次化关系,分别设计基于Conformer的层次化文本编码器和基于Conformer的层次化语音编码器,并提出了一种基于层次化文本-语音Conformer的语音合成模型。首先,该模型根据输入文本信号的长度,构建层次化文本编码器,包括音素级、单词级、语句级文本编码器3个层次,不同层次的文本编码器描述不同长度的文本信息;并使用Conformer的注意力机制来学习该长度信号中不同时间特征之间的关系。利用层次化的文本编码器,能够找出语句中不同长度需要强调的信息,有效实现不同长度的文本特征提取,缓解合成的语音信号持续时间长度不确定的问题。其次,层次化语音编码器包括音素级、单词级、语句级语音编码器3个层次。每个层次的语音编码器将文本特征作为Conformer的查询向量,将语音特征作为Conformer的关键字向量和值向量,来提取文本特征和语音特征的匹配关系。利用层次化的语音编码器和文本语音匹配关系,可以缓解不同长度语音信号合成不准确的问题。所提模型的层次化文本-语音编码器可以灵活地嵌入现有的多种解码器中,通过文本和语音之间的互补,提供更为可靠的语音合成结果。在LJSpeech和LibriTTS两个数据集上进行实验验证,实验结果表明,所提方法的梅尔倒谱失真小于现有语音合成方法。
中图分类号:
[1]AN X,DAI Z B,LI Y,et al.An end-to-end speech synthesis method based on BERT[J].Computer Science,2022,49(4):221-226. [2]LI N,LIU S,LIU Y,et al.Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6706-6713. [3]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//NIPS.2017:5998-6008. [4]YANG S,LU H,KANG S,et al.On the localness modeling for the self-attention based end-to-end speech synthesis[J].Neural networks,2020,125:121-130. [5]REN Y,RUAN Y,TAN X,et al.Fastspeech:Fast,robust and controllable text to speech[C]//NeurIPS.2019:3165-3174. [6]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and high-quality end-to-end text to speech[C]//9th International Conference on Learning Representations.Virtual Event:OpenReview.net,2021. [7]ŁABCUCKI A.Fastpitch:Parallel text-to-speech with pitch prediction[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:6588-6592. [8]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[C]//21st Annual Conference of the International Speech Communication Association.Shanghai:ISCA,2020:5036-5040. [9]LIU Y,XU Z,WANG G,et al.Delightfultts:The microsoftspeech synthesis system for blizzard challenge 2021[J].arXiv:2110.12612,2021. [10]DAI Z,YU J,WANG Y,et al.Automatic Prosody Annotation with Pre-Trained Text-Speech Model[C]//23rd Annual Confe-rence of the International Speech Communication Association.Incheon:ISCA,2022:5513-5517. [11]SKERRY-RYAN R J,BATTENBERG E,XIAO Y,et al.To-wards end-to-end prosody transfer for expressive speech synthesis with tacotron[C]//International Conference on Machine Learning.PMLR,2018:4693-4702. [12]CHEN M,TAN X,LI B.Adaspeech:Adaptive text to speech for custom voice[C]//9th International Conference on Learning Representations.Virtual Event:OpenReview.net,2021. [13]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783. [14]HE M,DENG Y,HE L.Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS[C]//20th Annual Conference of the International Speech Communication Association.Graz:ISCA,2019:1293-1297. [15]ZHENG Y,LI X,XIE F,et al.Improving end-to-end speech synthesis with local recurrent neural network enhanced transformer[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6734-6738. [16]ZHAO W,HE T,XU L.Enhancing local dependencies forTransformer-based text-to-speech via hybrid lightweight convolution[J].IEEE Access,2021,9:42762-42770. [17]LIU Y,XUE R,HE L,et al.DelightfulTTS 2:End-to-EndSpeech Synthesis with Adversarial Vector-Quantized Auto-Encoders[C]//23rd Annual Conference of the International Speech Communication Association.Incheon:ISCA,2022:1581-1585. [18]MORIOKA N,ZEN H,CHEN N,et al.Residual Adapters forFew-Shot Text-to-Speech Speaker Adaptation[J].arXiv:2210.15868,2022. [19]LEI S,ZHOU Y,CHEN L,et al.Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis[C]//2022 IEEE International Confe-rence on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:7922-7926. [20]WANG Y,STANTON D,ZHANG Y,et al.Style tokens:Unsupervised style modeling,control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning.PMLR,2018:5180-5189. [21]STANTON D,WANG Y,SKERRY-RYAN R J.Predicting expressive speaking style from text in end-to-end speech synthesis[C]//2018 IEEE Spoken Language Technology Workshop(SLT).IEEE,2018:595-602. [22]CHOI S,HAN S,KIM D,et al.Attentron:Few-shot text-to-speech utilizing attention-based variable-length embedding[C]//21st Annual Conference of the International Speech Communication Association.Shanghai:ISCA,2020:2007-2011. [23]ELIAS I,ZEN H,SHEN J,et al.Parallel tacotron:Non-autoregressive and controllable tts[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:5709-5713. [24]BAE J S,YANG J,BAK T J,et al.Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech[C]//23rd Annual Conference of the International Speech Communication Association.Incheon:ISCA,2022:813-817. [25]CHIEN C M,LEE H.Hierarchical prosody modeling for non-autoregressive speech synthesis[C]//2021 IEEE Spoken Language Technology Workshop(SLT).IEEE,2021:446-453. [26]RAMACHANDRAN P,ZOPH B,LE Q V,et al.Searching for activation functions[C]//6th International Conference on Learning Representations.Vancouver:OpenReview.net,2018. [27]DAUPHIN Y N,FAN A,AULI M,et al.Language modelingwith gated convolutional networks[C]//International Confe-rence on Machine Learning.PMLR,2017:933-941. [28]ZEN H,DANG V,CLARK R,et al.LibriTTS:A corpus derived from LibriSpeech for text-to-speech[C]//20th Annual Confe-rence of the International Speech Communication Association.Graz:ISCA,2019:1526-1530. [29]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033. |
|