计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 161-171.doi: 10.11896/jsjkx.221100125

• 计算机图形学&多媒体 • 上一篇    下一篇

基于层次化Conformer的语音合成

吴克伟1,2,3, 韩超3, 孙永宣1,2,3, 彭梦昊3, 谢昭1,2,3   

  1. 1 大数据知识工程教育部重点实验室(合肥工业大学) 合肥230601
    2 情感计算与先进智能机器安徽省重点实验室(合肥工业大学) 合肥230601
    3 合肥工业大学计算机与信息学院 合肥230601
  • 收稿日期:2022-11-15 修回日期:2023-04-11 出版日期:2024-02-15 发布日期:2024-02-22
  • 通讯作者: 谢昭(xiezhao@hfut.edu.cn)
  • 作者简介:(wu_kewei1984@163.com)
  • 基金资助:
    安徽省重点研究与开发计划(202004d07020004);安徽省自然科学基金(2108085MF203);中央高校基本科研业务费专项资金(PA2021GDSK0072,JZ2021HGQA0219)

Hierarchical Conformer Based Speech Synthesis

WU Kewei1,2,3, HAN Chao3, SUN Yongxuan1,2,3, PENG Menghao3, XIE Zhao1,2,3   

  1. 1 Key Laboratory of Knowledge Engineering with Big Data,Ministry of Education,Hefei University of Technology,Hefei 230601,China
    2 Anhui Provincial Key Laboratory of Emotional Computingand Advanced Intelligent Machine,Hefei University of Technology,Hefei 230601, China
    3 School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230601,China
  • Received:2022-11-15 Revised:2023-04-11 Online:2024-02-15 Published:2024-02-22
  • About author:WU Kewei,born in 1984,Ph.D,asso-ciate researcher,is a member ofCCF(No.42032M).His main research interests include speech synthesis,computer vision and deep learning.XIE Zhao,born in 1980,Ph.D,associate professor.His main research interests include computer vision,image analysis and understanding,and deep learning.
  • Supported by:
    Key Research and Development Program of Anhui Province(2004d07020004),Natural Science Foundation of Anhui Province(2108085MF203) and Special Funds for Basic Scientific Research Operations of Central Universities(PA2021GDSK0072,JZ2021HGQA0219).

摘要: 语音合成需要将输入语句的文本转换为包含音素、单词和语句的语音信号。现有语音合成方法将语句看作一个整体,难以准确地合成出不同长度的语音信号。通过分析语音信号中蕴含的层次化关系,分别设计基于Conformer的层次化文本编码器和基于Conformer的层次化语音编码器,并提出了一种基于层次化文本-语音Conformer的语音合成模型。首先,该模型根据输入文本信号的长度,构建层次化文本编码器,包括音素级、单词级、语句级文本编码器3个层次,不同层次的文本编码器描述不同长度的文本信息;并使用Conformer的注意力机制来学习该长度信号中不同时间特征之间的关系。利用层次化的文本编码器,能够找出语句中不同长度需要强调的信息,有效实现不同长度的文本特征提取,缓解合成的语音信号持续时间长度不确定的问题。其次,层次化语音编码器包括音素级、单词级、语句级语音编码器3个层次。每个层次的语音编码器将文本特征作为Conformer的查询向量,将语音特征作为Conformer的关键字向量和值向量,来提取文本特征和语音特征的匹配关系。利用层次化的语音编码器和文本语音匹配关系,可以缓解不同长度语音信号合成不准确的问题。所提模型的层次化文本-语音编码器可以灵活地嵌入现有的多种解码器中,通过文本和语音之间的互补,提供更为可靠的语音合成结果。在LJSpeech和LibriTTS两个数据集上进行实验验证,实验结果表明,所提方法的梅尔倒谱失真小于现有语音合成方法。

关键词: 语音合成, 文本编码器, 语音编码器, 层次化模型, Conformer

Abstract: Speech synthesis requires synthesizing the input speech text into a speech signal containing phonemes,words and utte-rances.Existing speech synthesis methods consider utterance as a whole,and it is difficult to synthesize different lengths of speech signals accurately.In this paper,we analyze the hierarchical relationships embedded in speech signals,design a Conformer-based hierarchical text encoder and a Conformer-based hierarchical speech encoder,and propose a speech synthesis model based on the hierarchical text-speech Conformer.First,the model constructs hierarchical text encoders according to the length of the input text signal,including three levels of phoneme level,word level,and utterance level text encoders.Each level of text encoder,describes text information of different lengths and uses Conformer’s attention mechanism to learn the relationship between different temporal features in the signal of that length.Using the hierarchical text encoder,we can find out the information that needs to be emphasized at different lengths in the utterance,and effectively achieve the extraction of text features at different lengths to alleviate the problem of uncertainty in the duration of the synthesized speech signal.Second,the hierarchical speech encoder includes three levels:phoneme level,word level,and utterance level speech encoder.For each level of speech encoder,the text features is used as the query vector of the Conformer,and the speech features are used as the keyword vector and value vector of the Conformer to extract the matching relationship between text features and speech features.The problem of inaccurate synthesis of diffe-rent length speech signals can be alleviated by using hierarchical speech encoder and text-to-speech matching relations.The hie-rarchical text-to-speech encoder modeled in this paper can be flexibly embedded into a variety of existing decoders to provide more reliable speech synthesis results through the complementarity between text and speech.Experimental validation is performed on two datasets,LJSpeech and LibriTTS,and experimental results show that the Mel inversion distortion of the proposed method is smaller than that of existing speech synthesis methods.

Key words: Speech synthesis, Text encoder, Speech encoder, Hierarchical model, Conformer

中图分类号: 

  • TP391
[1]AN X,DAI Z B,LI Y,et al.An end-to-end speech synthesis method based on BERT[J].Computer Science,2022,49(4):221-226.
[2]LI N,LIU S,LIU Y,et al.Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6706-6713.
[3]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//NIPS.2017:5998-6008.
[4]YANG S,LU H,KANG S,et al.On the localness modeling for the self-attention based end-to-end speech synthesis[J].Neural networks,2020,125:121-130.
[5]REN Y,RUAN Y,TAN X,et al.Fastspeech:Fast,robust and controllable text to speech[C]//NeurIPS.2019:3165-3174.
[6]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and high-quality end-to-end text to speech[C]//9th International Conference on Learning Representations.Virtual Event:OpenReview.net,2021.
[7]ŁABCUCKI A.Fastpitch:Parallel text-to-speech with pitch prediction[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:6588-6592.
[8]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[C]//21st Annual Conference of the International Speech Communication Association.Shanghai:ISCA,2020:5036-5040.
[9]LIU Y,XU Z,WANG G,et al.Delightfultts:The microsoftspeech synthesis system for blizzard challenge 2021[J].arXiv:2110.12612,2021.
[10]DAI Z,YU J,WANG Y,et al.Automatic Prosody Annotation with Pre-Trained Text-Speech Model[C]//23rd Annual Confe-rence of the International Speech Communication Association.Incheon:ISCA,2022:5513-5517.
[11]SKERRY-RYAN R J,BATTENBERG E,XIAO Y,et al.To-wards end-to-end prosody transfer for expressive speech synthesis with tacotron[C]//International Conference on Machine Learning.PMLR,2018:4693-4702.
[12]CHEN M,TAN X,LI B.Adaspeech:Adaptive text to speech for custom voice[C]//9th International Conference on Learning Representations.Virtual Event:OpenReview.net,2021.
[13]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783.
[14]HE M,DENG Y,HE L.Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS[C]//20th Annual Conference of the International Speech Communication Association.Graz:ISCA,2019:1293-1297.
[15]ZHENG Y,LI X,XIE F,et al.Improving end-to-end speech synthesis with local recurrent neural network enhanced transformer[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6734-6738.
[16]ZHAO W,HE T,XU L.Enhancing local dependencies forTransformer-based text-to-speech via hybrid lightweight convolution[J].IEEE Access,2021,9:42762-42770.
[17]LIU Y,XUE R,HE L,et al.DelightfulTTS 2:End-to-EndSpeech Synthesis with Adversarial Vector-Quantized Auto-Encoders[C]//23rd Annual Conference of the International Speech Communication Association.Incheon:ISCA,2022:1581-1585.
[18]MORIOKA N,ZEN H,CHEN N,et al.Residual Adapters forFew-Shot Text-to-Speech Speaker Adaptation[J].arXiv:2210.15868,2022.
[19]LEI S,ZHOU Y,CHEN L,et al.Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis[C]//2022 IEEE International Confe-rence on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:7922-7926.
[20]WANG Y,STANTON D,ZHANG Y,et al.Style tokens:Unsupervised style modeling,control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning.PMLR,2018:5180-5189.
[21]STANTON D,WANG Y,SKERRY-RYAN R J.Predicting expressive speaking style from text in end-to-end speech synthesis[C]//2018 IEEE Spoken Language Technology Workshop(SLT).IEEE,2018:595-602.
[22]CHOI S,HAN S,KIM D,et al.Attentron:Few-shot text-to-speech utilizing attention-based variable-length embedding[C]//21st Annual Conference of the International Speech Communication Association.Shanghai:ISCA,2020:2007-2011.
[23]ELIAS I,ZEN H,SHEN J,et al.Parallel tacotron:Non-autoregressive and controllable tts[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:5709-5713.
[24]BAE J S,YANG J,BAK T J,et al.Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech[C]//23rd Annual Conference of the International Speech Communication Association.Incheon:ISCA,2022:813-817.
[25]CHIEN C M,LEE H.Hierarchical prosody modeling for non-autoregressive speech synthesis[C]//2021 IEEE Spoken Language Technology Workshop(SLT).IEEE,2021:446-453.
[26]RAMACHANDRAN P,ZOPH B,LE Q V,et al.Searching for activation functions[C]//6th International Conference on Learning Representations.Vancouver:OpenReview.net,2018.
[27]DAUPHIN Y N,FAN A,AULI M,et al.Language modelingwith gated convolutional networks[C]//International Confe-rence on Machine Learning.PMLR,2017:933-941.
[28]ZEN H,DANG V,CLARK R,et al.LibriTTS:A corpus derived from LibriSpeech for text-to-speech[C]//20th Annual Confe-rence of the International Speech Communication Association.Graz:ISCA,2019:1526-1530.
[29]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!