Computer Science ›› 2021, Vol. 48 ›› Issue (8): 200-208.doi: 10.11896/jsjkx.200500148
• Artificial Intelligence • Previous Articles Next Articles
PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin
CLC Number:
[1]RealTalk[OL].https://medium.com/dessa-news/real-talk-speechsynthesis-5dd0897eef7f. [2]MelNet[OL].https://sjvasquez.github.io/blog/melnet/. [3]MBIUS B,SPROAT R,SANTEN J,et al.The bell labs German text-to-speech system:an overview[C]//Fifth European Confe-rence on Speech Communication and Technology.1997:22-25. [4]WU Y J,WANG R H.Minimum Generation Error Training for HMM-Based Speech Synthesis[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2006:89-92. [5]ZEN H,BRAUNSCHWEILER N.Context-dependent additivelog F0 model for HMM-based speech synthesis[C]//Confe-rence of the International Speech Communication Association.2009:2091-2094. [6]TODA T,SARUWATARI H,SHIKANO K.Voice conversionalgorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2001:841-844. [7]AIHARA R,TAKASHIMA R,TAKIGUCHI T,et al.GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features[J].American Journal of Signal Processing,2012,2(5):134-138. [8]ARIK S O,CHRZANOWSKI M,COATES A,et al.DeepVoice:Real-time Neural Text-to-Speech[J].arXiv:1702.07825,2017. [9]WANG Y,SKERRYRYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Conference of the International Speech Communication Association.2017:4006-4010. [10]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative Adversarial Networks[J].Advances in Neural Information Processing Systems,2014,3:2672-2680. [11]LEMMETTY S.Review of Speech Synthesis Technology[D].Helsinki University of Technology,1999. [12]ZE H,SENIOR A W,SCHUSTER M,et al.Statistical parametric speech synthesis using deep neural networks[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2013:7962-7966. [13]LU H,SIMON K,OLIVER W.Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis[C]//The 8th ISCA Speech Synthesis Workshop.2013:261-265. [14]WU Z,TAKAKI S,YAMAGISHI J.Deep Denoising Auto-encoder for Statistical Speech Synthesis[J].arXiv:1506.05268,2015. [15]KANG S,QIAN X,MENG H.Multi-distribution deep beliefnetwork for speech synthesis [C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:8012-8016. [16]YIN X,LING Z H,HU Y J.Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(10):2129-2139. [17]FERNANDEZ R,RENDEL A,RAMABHADRAN B,et al.Prosody Contour Prediction with Long Short-Term Memory,Bi-Directional,Deep Recurrent Neural Networks[C]//Conference of the International Speech Communication Association.2014:2268-2272. [18]FAN Y,QIAN Y,XIE F,et al.TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//Conference of the International Speech Communication Association.2014:1964-1968. [19]DING C,XIE L,YAN J,et al.Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features[C]//2015 IEEE Workshop on Automatic Speech Reco-gnition and Understanding.IEEE,2015:98-102. [20]OORD A V D,DIELEMAN S,ZEN H,et al.WaveNet:A Gene-rative Model for Raw Audio[J].arXiv:1609.03499,2016. [21]MEHRI S,KUMAR K,GULRAJANI I,et al.SampleRNN:An Unconditional End-to-End Neural Audio Generation Model[J].arXiv:1612.07837,2016. [22]KALCHBRENNER N,ELSEN E,SIMONYAN K,et al.Efficient neural audio synthesis[J].arXiv:1802.08435,2018. [23]OORD A V D,LI Y,BABUSCHKIN I,et al.Parallel WaveNet:Fast High-Fidelity Speech Synthesis[J].arXiv:1711.10433,2017. [24]PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//InternationalConference on Acoustics,Speech and Signal Processing.IEEE,2019:3617-3621. [25]ZHAI B,GAO T,XUE F,et al.SqueezeWave:Extremely Lightweight Vocoders for On-device Speech Synthesis[J].arXiv:2001.05685,2020. [26]ARIK S O,DIAMOS G,GIBIANSKY A,et al.Deep Voice 2:Multi-Speaker Neural Text-to-Speech[C]//Advances in Neural Information Processing Systems.Curran Associates,2017:2962-2970. [27]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning[J].ar-Xiv:1710.07654,2017. [28]SHEN J,PANG R,WEISS R,et al.Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018:4779-4783. [29]SOTELO J,MEHRI S,KUMAR K,et al.Char2Wav:End-to-End Speech Synthesis.ICLR 2017 Workshop Submission[EB/OL].(2017-04-16)[2020-05-26].https://openreview.net/forum?id=B1VWyySKx. [30]LIU P,WU X,KANG S,et al.Maximizing Mutual Information for Tacotron[J].arXiv:1909.01145,2019. [31]MING H,HE L,GUO H,et al.Feature reinforcementwith word embedding and parsing information in neural TTS[J].arXiv:1901.00707,2019. [32]WANG Y,STANTON D,ZHANG Y,et al.Style Tokens:Unsupervised Style Modeling,Control and Transfer in End-to-End Speech Synthesis[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:5180-5189. [33]LEE Y,KIM T.Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:5911-5915. [34]ZHANG Y,PAN S,HE L,et al.Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:6945-6949. [35]AGGARWAL V,COTESCU M,PRATEEK N,et al.UsingVAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech[J].arXiv:1911.12760,2019. [36]HU T Y,SHRIVASTAVA A,TUZEL O,et al.Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:3267-3271. [37]SUN G,ZHANG Y,WEISS R J,et al.Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6264-6268. [38]MA S,MCDUFF D,SONG Y,et al.Neural TTS Stylization with Adversarial and Collaborative Games.ICLR 2019 Confe-rence Blind Submission[EB/OL].(2019-02-23)[2020-05-26].https://openreview.net/pdf?id=ByzcS3AcYX. [39]TACHIBANA H,UENOYAMA K,AIHARA S.Efficientlytrainable text-to-speech system based on deep convolutional networks with guided attention[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2018:4784-4788. [40]PING W,PENG K,CHEN J,et al.ClariNet:Parallel Wave Ge-neration in End-to-End Text-to-Speech[J].arXiv:1807.07281,2018. [41]PARK J,ZHAO K,PENG K,et al.Multi-Speaker End-to-End Speech Synthesis[J].arXiv:1907.04462,2019. [42]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,Robust and Controllable Text to Speech[C]//Advances in Neural Information Processing Systems.2019:3171-3180. [43]BINKOWSKI M,DONAHUE J,DIELEMAN S,et al.High Fidelity Speech Synthesis with Adversarial Networks[J].arXiv:1909.11646,2019. [44]MOSS H B,AGGARWAL V,PRATEEK N,et al.BOFFINTTS:Few-Shot Speaker Adaptation by Bayesian Optimization[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:7639-7643. [45]WU Z,CHNG E S,LI H,et al.Conditional restricted Boltzmann machine for voice conversion[C]//International Conference on Signal and Information Processing.IEEE,2013:104-108. [46]NAKASHIKA T,TAKIGUCHI T,ARIKI Y.Voice conversion in time-invariant speaker-independent space[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2014:7889-7893. [47]JIAO Y,XIE X,NA X,et al.Improving voice quality of HMM-based speech synthesis using voice conversion method[C]//International Conference on Acoustics Speech and Signal Proces-sing.IEEE,2014:7914-7918. [48]KANEKO T,KAMEOKA H.Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks[J].arXiv:1711.11293,2017. [49]KANEKO T,KAMEOKA H,TANAKA K,et al.CycleGAN-VC2:Improved CycleGAN-based Non-parallel Voice Conversion[C]//International Conference on Acoustics Speech and Signal Processing.IEEE,2019:6820-6824. [50]ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.2017:1125-1134. [51]KAMEOKA H,KANEKO T,TANAKA K,et al.StarGAN-VC:Non-parallel many-to-many voice conversion with star ge-nerative adversarial networks[C]//Spoken Language Technology Workshop.IEEE,2018:266-273. [52]KANEKO T,KAMEOKA H,TANAKA K,et al.StarGAN-VC2:Rethinking Conditional Methods for StarGAN-Based Voice Conversion[C]//Conference of the International Speech Communication Association.2019:679-683. [53]HSU C C,HWANG H T,WU Y C,et al.Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder[C]//Asia Pacific Signal and Information Processing Association Annual Summit and Conference.IEEE,2016:1-6. [54]CHOU J C,LEE H Y.One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization[C]//Conference of the International Speech Communication Association.2019:664-668. [55]QIAN K,ZHANG Y,CHANG S,et al.AUTOVC:Zero-ShotVoice Style Transfer with Only Autoencoder Loss[C]//Proceedings of the 36th International Conference on Machine Learning.PMLR,2019:5210-5219. [56]QIAN K,JIN Z,HASEGAWA-JOHNSON M,et al.F0-consis-tent Many-to-many Non-parallel Voice Conversion via Conditional Autoencoder[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6284-6288. [57]JUNG S,SUH Y,CHOI Y,et al.Non-parallel Voice Conversion Based on Source-to-target Direct Mapping[J].arXiv:2006.06937,2020. [58]POLYAK A,WOLF L,TAIGMAN Y.TTS Skins:SpeakerConversion via ASR[J].arXiv:1904.08983,2019. [59]REBRYK Y,BELIAEV S.ConVoice:Real-Time Zero-ShotVoice Style Transfer with Convolutional Network[J].arXiv:2005.07815,2020. [60]TAO J H,FU R B,YI J Y,et al.Development and Challenge of Speech Forgery and Detection[J].Journal of Cyber Security,2020,5(2):28-38. |
[1] | XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171. |
[2] | RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207. |
[3] | TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305. |
[4] | WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293. |
[5] | HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329. |
[6] | JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335. |
[7] | SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177. |
[8] | HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78. |
[9] | CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126. |
[10] | HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163. |
[11] | ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169. |
[12] | SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235. |
[13] | ZHU Wen-tao, LAN Xian-chao, LUO Huan-lin, YUE Bing, WANG Yang. Remote Sensing Aircraft Target Detection Based on Improved Faster R-CNN [J]. Computer Science, 2022, 49(6A): 378-383. |
[14] | WANG Jian-ming, CHEN Xiang-yu, YANG Zi-zhong, SHI Chen-yang, ZHANG Yu-hang, QIAN Zheng-kun. Influence of Different Data Augmentation Methods on Model Recognition Accuracy [J]. Computer Science, 2022, 49(6A): 418-423. |
[15] | MAO Dian-hui, HUANG Hui-yu, ZHAO Shuang. Study on Automatic Synthetic News Detection Method Complying with Regulatory Compliance [J]. Computer Science, 2022, 49(6A): 523-530. |
|