Computer Science ›› 2024, Vol. 51 ›› Issue (2): 161-171.doi: 10.11896/jsjkx.221100125
• Computer Graphics & Multimedia • Previous Articles Next Articles
WU Kewei1,2,3, HAN Chao3, SUN Yongxuan1,2,3, PENG Menghao3, XIE Zhao1,2,3
CLC Number:
[1]AN X,DAI Z B,LI Y,et al.An end-to-end speech synthesis method based on BERT[J].Computer Science,2022,49(4):221-226. [2]LI N,LIU S,LIU Y,et al.Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6706-6713. [3]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//NIPS.2017:5998-6008. [4]YANG S,LU H,KANG S,et al.On the localness modeling for the self-attention based end-to-end speech synthesis[J].Neural networks,2020,125:121-130. [5]REN Y,RUAN Y,TAN X,et al.Fastspeech:Fast,robust and controllable text to speech[C]//NeurIPS.2019:3165-3174. [6]REN Y,HU C,TAN X,et al.Fastspeech 2:Fast and high-quality end-to-end text to speech[C]//9th International Conference on Learning Representations.Virtual Event:OpenReview.net,2021. [7]ŁABCUCKI A.Fastpitch:Parallel text-to-speech with pitch prediction[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:6588-6592. [8]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[C]//21st Annual Conference of the International Speech Communication Association.Shanghai:ISCA,2020:5036-5040. [9]LIU Y,XU Z,WANG G,et al.Delightfultts:The microsoftspeech synthesis system for blizzard challenge 2021[J].arXiv:2110.12612,2021. [10]DAI Z,YU J,WANG Y,et al.Automatic Prosody Annotation with Pre-Trained Text-Speech Model[C]//23rd Annual Confe-rence of the International Speech Communication Association.Incheon:ISCA,2022:5513-5517. [11]SKERRY-RYAN R J,BATTENBERG E,XIAO Y,et al.To-wards end-to-end prosody transfer for expressive speech synthesis with tacotron[C]//International Conference on Machine Learning.PMLR,2018:4693-4702. [12]CHEN M,TAN X,LI B.Adaspeech:Adaptive text to speech for custom voice[C]//9th International Conference on Learning Representations.Virtual Event:OpenReview.net,2021. [13]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:4779-4783. [14]HE M,DENG Y,HE L.Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS[C]//20th Annual Conference of the International Speech Communication Association.Graz:ISCA,2019:1293-1297. [15]ZHENG Y,LI X,XIE F,et al.Improving end-to-end speech synthesis with local recurrent neural network enhanced transformer[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6734-6738. [16]ZHAO W,HE T,XU L.Enhancing local dependencies forTransformer-based text-to-speech via hybrid lightweight convolution[J].IEEE Access,2021,9:42762-42770. [17]LIU Y,XUE R,HE L,et al.DelightfulTTS 2:End-to-EndSpeech Synthesis with Adversarial Vector-Quantized Auto-Encoders[C]//23rd Annual Conference of the International Speech Communication Association.Incheon:ISCA,2022:1581-1585. [18]MORIOKA N,ZEN H,CHEN N,et al.Residual Adapters forFew-Shot Text-to-Speech Speaker Adaptation[J].arXiv:2210.15868,2022. [19]LEI S,ZHOU Y,CHEN L,et al.Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis[C]//2022 IEEE International Confe-rence on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:7922-7926. [20]WANG Y,STANTON D,ZHANG Y,et al.Style tokens:Unsupervised style modeling,control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning.PMLR,2018:5180-5189. [21]STANTON D,WANG Y,SKERRY-RYAN R J.Predicting expressive speaking style from text in end-to-end speech synthesis[C]//2018 IEEE Spoken Language Technology Workshop(SLT).IEEE,2018:595-602. [22]CHOI S,HAN S,KIM D,et al.Attentron:Few-shot text-to-speech utilizing attention-based variable-length embedding[C]//21st Annual Conference of the International Speech Communication Association.Shanghai:ISCA,2020:2007-2011. [23]ELIAS I,ZEN H,SHEN J,et al.Parallel tacotron:Non-autoregressive and controllable tts[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:5709-5713. [24]BAE J S,YANG J,BAK T J,et al.Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech[C]//23rd Annual Conference of the International Speech Communication Association.Incheon:ISCA,2022:813-817. [25]CHIEN C M,LEE H.Hierarchical prosody modeling for non-autoregressive speech synthesis[C]//2021 IEEE Spoken Language Technology Workshop(SLT).IEEE,2021:446-453. [26]RAMACHANDRAN P,ZOPH B,LE Q V,et al.Searching for activation functions[C]//6th International Conference on Learning Representations.Vancouver:OpenReview.net,2018. [27]DAUPHIN Y N,FAN A,AULI M,et al.Language modelingwith gated convolutional networks[C]//International Confe-rence on Machine Learning.PMLR,2017:933-941. [28]ZEN H,DANG V,CLARK R,et al.LibriTTS:A corpus derived from LibriSpeech for text-to-speech[C]//20th Annual Confe-rence of the International Speech Communication Association.Graz:ISCA,2019:1526-1530. [29]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033. |
[1] | YANG Lin, YANG Jian, CAI Haoran, LIU Cong. Vietnamese Speech Synthesis Based on Transfer Learning [J]. Computer Science, 2023, 50(8): 118-124. |
[2] | CAI Haoran, YANG Jian, YANG Lin, LIU Cong. Low-resource Thai Speech Synthesis Based on Alternate Training and Pre-training [J]. Computer Science, 2023, 50(6A): 220800127-5. |
[3] | HU Yu-jiao, JIA Qing-min, SUN Qing-shuang, XIE Ren-chao, HUANG Tao. Functional Architecture to Intelligent Computing Power Network [J]. Computer Science, 2022, 49(9): 249-259. |
[4] | AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji. End-to-End Speech Synthesis Based on BERT [J]. Computer Science, 2022, 49(4): 221-226. |
[5] | PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin. Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning [J]. Computer Science, 2021, 48(8): 200-208. |
[6] | TANG Hao-feng, DONG Yuan-fang, ZHANG Yi-tong, SUN Juan-juan. Survey of Image Inpainting Algorithms Based on Deep Learning [J]. Computer Science, 2020, 47(11A): 151-164. |
[7] | ZHAO Jiao-jiao, MA Wen-ping, LUO Wei, LIU Xiao-xue. Hierarchical Hybrid Authentication Model Based on Key Sharing [J]. Computer Science, 2019, 46(2): 115-119. |
[8] | DONG Jian-kang, TANG Chao, GENG Hong. Correlation-Hierarchy Based Virtual Maintenance Modeling Method for ComplexElectromechanical Components of Aircraft [J]. Computer Science, 2018, 45(12): 192-195. |
[9] | WU Zhong-zhi. Research on Hierarchical Modeling Technology of Typical System Based on Architecture [J]. Computer Science, 2018, 45(11A): 542-544. |
[10] | JIA Xi-bin,YIN Bao-cai and SUN Yan-fen. Bi-level Codebook Based Speech-driven Visual-speech Synthesis System [J]. Computer Science, 2014, 41(1): 100-104. |
[11] | ZHAO Jian-dong,GAO Guang-lai and BAO Fei-long. Research on HMM-based Mongolian Speech Synthesis [J]. Computer Science, 2014, 41(1): 80-82. |
[12] | YANG Pei,TAN Qi, DING Yue-hua. Non-linear Transfer Learning Model [J]. Computer Science, 2009, 36(8): 212-214. |
[13] | ZHU Qing-Sheng ,ZHANG Min, LIU Feng ( Computer College,Chongqing University, Chongqing 400030). [J]. Computer Science, 2008, 35(4): 231-232. |
|