Computer Science ›› 2024, Vol. 51 ›› Issue (2): 161-171.doi: 10.11896/jsjkx.221100125

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Hierarchical Conformer Based Speech Synthesis

WU Kewei1,2,3, HAN Chao3, SUN Yongxuan1,2,3, PENG Menghao3, XIE Zhao1,2,3   

  1. 1 Key Laboratory of Knowledge Engineering with Big Data,Ministry of Education,Hefei University of Technology,Hefei 230601,China
    2 Anhui Provincial Key Laboratory of Emotional Computingand Advanced Intelligent Machine,Hefei University of Technology,Hefei 230601, China
    3 School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230601,China
  • Received:2022-11-15 Revised:2023-04-11 Online:2024-02-15 Published:2024-02-22
  • About author:WU Kewei,born in 1984,Ph.D,asso-ciate researcher,is a member ofCCF(No.42032M).His main research interests include speech synthesis,computer vision and deep learning.XIE Zhao,born in 1980,Ph.D,associate professor.His main research interests include computer vision,image analysis and understanding,and deep learning.
  • Supported by:
    Key Research and Development Program of Anhui Province(2004d07020004),Natural Science Foundation of Anhui Province(2108085MF203) and Special Funds for Basic Scientific Research Operations of Central Universities(PA2021GDSK0072,JZ2021HGQA0219).

Abstract: Speech synthesis requires synthesizing the input speech text into a speech signal containing phonemes,words and utte-rances.Existing speech synthesis methods consider utterance as a whole,and it is difficult to synthesize different lengths of speech signals accurately.In this paper,we analyze the hierarchical relationships embedded in speech signals,design a Conformer-based hierarchical text encoder and a Conformer-based hierarchical speech encoder,and propose a speech synthesis model based on the hierarchical text-speech Conformer.First,the model constructs hierarchical text encoders according to the length of the input text signal,including three levels of phoneme level,word level,and utterance level text encoders.Each level of text encoder,describes text information of different lengths and uses Conformer’s attention mechanism to learn the relationship between different temporal features in the signal of that length.Using the hierarchical text encoder,we can find out the information that needs to be emphasized at different lengths in the utterance,and effectively achieve the extraction of text features at different lengths to alleviate the problem of uncertainty in the duration of the synthesized speech signal.Second,the hierarchical speech encoder includes three levels:phoneme level,word level,and utterance level speech encoder.For each level of speech encoder,the text features is used as the query vector of the Conformer,and the speech features are used as the keyword vector and value vector of the Conformer to extract the matching relationship between text features and speech features.The problem of inaccurate synthesis of diffe-rent length speech signals can be alleviated by using hierarchical speech encoder and text-to-speech matching relations.The hie-rarchical text-to-speech encoder modeled in this paper can be flexibly embedded into a variety of existing decoders to provide more reliable speech synthesis results through the complementarity between text and speech.Experimental validation is performed on two datasets,LJSpeech and LibriTTS,and experimental results show that the Mel inversion distortion of the proposed method is smaller than that of existing speech synthesis methods.

Key words: Speech synthesis, Text encoder, Speech encoder, Hierarchical model, Conformer

CLC Number: 

  • TP391
[1]AN X,DAI Z B,LI Y,et al.An end-to-end speech synthesis method based on BERT[J].Computer Science,2022,49(4):221-226.
[2]LI N,LIU S,LIU Y,et al.Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6706-6713.
[3]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//NIPS.2017:5998-6008.
[4]YANG S,LU H,KANG S,et al.On the localness modeling for the self-attention based end-to-end speech synthesis[J].Neural networks,2020,125:121-130.
[5]REN Y,RUAN Y,TAN X,et al.Fastspeech:Fast,robust and controllable text to speech[C]//NeurIPS.2019:3165-3174.
[6]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and high-quality end-to-end text to speech[C]//9th International Conference on Learning Representations.Virtual Event:OpenReview.net,2021.
[7]ŁABCUCKI A.Fastpitch:Parallel text-to-speech with pitch prediction[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:6588-6592.
[8]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[C]//21st Annual Conference of the International Speech Communication Association.Shanghai:ISCA,2020:5036-5040.
[9]LIU Y,XU Z,WANG G,et al.Delightfultts:The microsoftspeech synthesis system for blizzard challenge 2021[J].arXiv:2110.12612,2021.
[10]DAI Z,YU J,WANG Y,et al.Automatic Prosody Annotation with Pre-Trained Text-Speech Model[C]//23rd Annual Confe-rence of the International Speech Communication Association.Incheon:ISCA,2022:5513-5517.
[11]SKERRY-RYAN R J,BATTENBERG E,XIAO Y,et al.To-wards end-to-end prosody transfer for expressive speech synthesis with tacotron[C]//International Conference on Machine Learning.PMLR,2018:4693-4702.
[12]CHEN M,TAN X,LI B.Adaspeech:Adaptive text to speech for custom voice[C]//9th International Conference on Learning Representations.Virtual Event:OpenReview.net,2021.
[13]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783.
[14]HE M,DENG Y,HE L.Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS[C]//20th Annual Conference of the International Speech Communication Association.Graz:ISCA,2019:1293-1297.
[15]ZHENG Y,LI X,XIE F,et al.Improving end-to-end speech synthesis with local recurrent neural network enhanced transformer[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6734-6738.
[16]ZHAO W,HE T,XU L.Enhancing local dependencies forTransformer-based text-to-speech via hybrid lightweight convolution[J].IEEE Access,2021,9:42762-42770.
[17]LIU Y,XUE R,HE L,et al.DelightfulTTS 2:End-to-EndSpeech Synthesis with Adversarial Vector-Quantized Auto-Encoders[C]//23rd Annual Conference of the International Speech Communication Association.Incheon:ISCA,2022:1581-1585.
[18]MORIOKA N,ZEN H,CHEN N,et al.Residual Adapters forFew-Shot Text-to-Speech Speaker Adaptation[J].arXiv:2210.15868,2022.
[19]LEI S,ZHOU Y,CHEN L,et al.Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis[C]//2022 IEEE International Confe-rence on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:7922-7926.
[20]WANG Y,STANTON D,ZHANG Y,et al.Style tokens:Unsupervised style modeling,control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning.PMLR,2018:5180-5189.
[21]STANTON D,WANG Y,SKERRY-RYAN R J.Predicting expressive speaking style from text in end-to-end speech synthesis[C]//2018 IEEE Spoken Language Technology Workshop(SLT).IEEE,2018:595-602.
[22]CHOI S,HAN S,KIM D,et al.Attentron:Few-shot text-to-speech utilizing attention-based variable-length embedding[C]//21st Annual Conference of the International Speech Communication Association.Shanghai:ISCA,2020:2007-2011.
[23]ELIAS I,ZEN H,SHEN J,et al.Parallel tacotron:Non-autoregressive and controllable tts[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:5709-5713.
[24]BAE J S,YANG J,BAK T J,et al.Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech[C]//23rd Annual Conference of the International Speech Communication Association.Incheon:ISCA,2022:813-817.
[25]CHIEN C M,LEE H.Hierarchical prosody modeling for non-autoregressive speech synthesis[C]//2021 IEEE Spoken Language Technology Workshop(SLT).IEEE,2021:446-453.
[26]RAMACHANDRAN P,ZOPH B,LE Q V,et al.Searching for activation functions[C]//6th International Conference on Learning Representations.Vancouver:OpenReview.net,2018.
[27]DAUPHIN Y N,FAN A,AULI M,et al.Language modelingwith gated convolutional networks[C]//International Confe-rence on Machine Learning.PMLR,2017:933-941.
[28]ZEN H,DANG V,CLARK R,et al.LibriTTS:A corpus derived from LibriSpeech for text-to-speech[C]//20th Annual Confe-rence of the International Speech Communication Association.Graz:ISCA,2019:1526-1530.
[29]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[1] YANG Lin, YANG Jian, CAI Haoran, LIU Cong. Vietnamese Speech Synthesis Based on Transfer Learning [J]. Computer Science, 2023, 50(8): 118-124.
[2] CAI Haoran, YANG Jian, YANG Lin, LIU Cong. Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training [J]. Computer Science, 2023, 50(6A): 220800127-5.
[3] HU Yu-jiao, JIA Qing-min, SUN Qing-shuang, XIE Ren-chao, HUANG Tao. Functional Architecture to Intelligent Computing Power Network [J]. Computer Science, 2022, 49(9): 249-259.
[4] AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji. End-to-End Speech Synthesis Based on BERT [J]. Computer Science, 2022, 49(4): 221-226.
[5] PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin. Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning [J]. Computer Science, 2021, 48(8): 200-208.
[6] TANG Hao-feng, DONG Yuan-fang, ZHANG Yi-tong, SUN Juan-juan. Survey of Image Inpainting Algorithms Based on Deep Learning [J]. Computer Science, 2020, 47(11A): 151-164.
[7] ZHAO Jiao-jiao, MA Wen-ping, LUO Wei, LIU Xiao-xue. Hierarchical Hybrid Authentication Model Based on Key Sharing [J]. Computer Science, 2019, 46(2): 115-119.
[8] DONG Jian-kang, TANG Chao, GENG Hong. Correlation-Hierarchy Based Virtual Maintenance Modeling Method for ComplexElectromechanical Components of Aircraft [J]. Computer Science, 2018, 45(12): 192-195.
[9] WU Zhong-zhi. Research on Hierarchical Modeling Technology of Typical System Based on Architecture [J]. Computer Science, 2018, 45(11A): 542-544.
[10] JIA Xi-bin,YIN Bao-cai and SUN Yan-fen. Bi-level Codebook Based Speech-driven Visual-speech Synthesis System [J]. Computer Science, 2014, 41(1): 100-104.
[11] ZHAO Jian-dong,GAO Guang-lai and BAO Fei-long. Research on HMM-based Mongolian Speech Synthesis [J]. Computer Science, 2014, 41(1): 80-82.
[12] YANG Pei,TAN Qi, DING Yue-hua. Non-linear Transfer Learning Model [J]. Computer Science, 2009, 36(8): 212-214.
[13] ZHU Qing-Sheng ,ZHANG Min, LIU Feng ( Computer College,Chongqing University, Chongqing 400030). [J]. Computer Science, 2008, 35(4): 231-232.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!