计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 200-208.doi: 10.11896/jsjkx.200500148
潘孝勤, 芦天亮, 杜彦辉, 仝鑫
PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin
摘要: 语音信息处理技术在深度学习的推动下发展迅速,其中语音合成和转换技术相结合能实现实时高保真的指定对象、内容的语音输出,在人机交互、泛娱乐等领域具有广泛的应用前景。文中旨在对基于深度学习的语音合成与转换技术进行综述。首先,简要回顾了语音合成和转换技术的发展历程;接着,列举了在语音合成、转换领域的常见公开数据集以便研究者开展相关探索;然后,讨论了从文本到语音模型,包括在风格、韵律、速度等方面进行改进的经典和前沿的模型、算法,并分别对比评述了其效果与发展潜力;进一步针对语音转换进行综述,归纳总结了转换方法与优化思路;最后,总结了语音合成与转换的应用与挑战,并根据其在模型、应用和规范方面所面临的问题,展望了未来在模型压缩、少样本学习和伪造检测方面的发展方向。
中图分类号:
[1]RealTalk[OL].https://medium.com/dessa-news/real-talk-speechsynthesis-5dd0897eef7f. [2]MelNet[OL].https://sjvasquez.github.io/blog/melnet/. [3]MBIUS B,SPROAT R,SANTEN J,et al.The bell labs German text-to-speech system:an overview[C]//Fifth European Confe-rence on Speech Communication and Technology.1997:22-25. [4]WU Y J,WANG R H.Minimum Generation Error Training for HMM-Based Speech Synthesis[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2006:89-92. [5]ZEN H,BRAUNSCHWEILER N.Context-dependent additivelog F0 model for HMM-based speech synthesis[C]//Confe-rence of the International Speech Communication Association.2009:2091-2094. [6]TODA T,SARUWATARI H,SHIKANO K.Voice conversionalgorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2001:841-844. [7]AIHARA R,TAKASHIMA R,TAKIGUCHI T,et al.GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features[J].American Journal of Signal Processing,2012,2(5):134-138. [8]ARIK S O,CHRZANOWSKI M,COATES A,et al.DeepVoice:Real-time Neural Text-to-Speech[J].arXiv:1702.07825,2017. [9]WANG Y,SKERRYRYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Conference of the International Speech Communication Association.2017:4006-4010. [10]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative Adversarial Networks[J].Advances in Neural Information Processing Systems,2014,3:2672-2680. [11]LEMMETTY S.Review of Speech Synthesis Technology[D].Helsinki University of Technology,1999. [12]ZE H,SENIOR A W,SCHUSTER M,et al.Statistical parametric speech synthesis using deep neural networks[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2013:7962-7966. [13]LU H,SIMON K,OLIVER W.Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis[C]//The 8th ISCA Speech Synthesis Workshop.2013:261-265. [14]WU Z,TAKAKI S,YAMAGISHI J.Deep Denoising Auto-encoder for Statistical Speech Synthesis[J].arXiv:1506.05268,2015. [15]KANG S,QIAN X,MENG H.Multi-distribution deep beliefnetwork for speech synthesis [C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:8012-8016. [16]YIN X,LING Z H,HU Y J.Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(10):2129-2139. [17]FERNANDEZ R,RENDEL A,RAMABHADRAN B,et al.Prosody Contour Prediction with Long Short-Term Memory,Bi-Directional,Deep Recurrent Neural Networks[C]//Conference of the International Speech Communication Association.2014:2268-2272. [18]FAN Y,QIAN Y,XIE F,et al.TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//Conference of the International Speech Communication Association.2014:1964-1968. [19]DING C,XIE L,YAN J,et al.Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features[C]//2015 IEEE Workshop on Automatic Speech Reco-gnition and Understanding.IEEE,2015:98-102. [20]OORD A V D,DIELEMAN S,ZEN H,et al.WaveNet:A Gene-rative Model for Raw Audio[J].arXiv:1609.03499,2016. [21]MEHRI S,KUMAR K,GULRAJANI I,et al.SampleRNN:An Unconditional End-to-End Neural Audio Generation Model[J].arXiv:1612.07837,2016. [22]KALCHBRENNER N,ELSEN E,SIMONYAN K,et al.Efficient neural audio synthesis[J].arXiv:1802.08435,2018. [23]OORD A V D,LI Y,BABUSCHKIN I,et al.Parallel WaveNet:Fast High-Fidelity Speech Synthesis[J].arXiv:1711.10433,2017. [24]PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//InternationalConference on Acoustics,Speech and Signal Processing.IEEE,2019:3617-3621. [25]ZHAI B,GAO T,XUE F,et al.SqueezeWave:Extremely Lightweight Vocoders for On-device Speech Synthesis[J].arXiv:2001.05685,2020. [26]ARIK S O,DIAMOS G,GIBIANSKY A,et al.Deep Voice 2:Multi-Speaker Neural Text-to-Speech[C]//Advances in Neural Information Processing Systems.Curran Associates,2017:2962-2970. [27]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning[J].ar-Xiv:1710.07654,2017. [28]SHEN J,PANG R,WEISS R,et al.Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018:4779-4783. [29]SOTELO J,MEHRI S,KUMAR K,et al.Char2Wav:End-to-End Speech Synthesis.ICLR 2017 Workshop Submission[EB/OL].(2017-04-16)[2020-05-26].https://openreview.net/forum?id=B1VWyySKx. [30]LIU P,WU X,KANG S,et al.Maximizing Mutual Information for Tacotron[J].arXiv:1909.01145,2019. [31]MING H,HE L,GUO H,et al.Feature reinforcementwith word embedding and parsing information in neural TTS[J].arXiv:1901.00707,2019. [32]WANG Y,STANTON D,ZHANG Y,et al.Style Tokens:Unsupervised Style Modeling,Control and Transfer in End-to-End Speech Synthesis[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:5180-5189. [33]LEE Y,KIM T.Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:5911-5915. [34]ZHANG Y,PAN S,HE L,et al.Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:6945-6949. [35]AGGARWAL V,COTESCU M,PRATEEK N,et al.UsingVAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech[J].arXiv:1911.12760,2019. [36]HU T Y,SHRIVASTAVA A,TUZEL O,et al.Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:3267-3271. [37]SUN G,ZHANG Y,WEISS R J,et al.Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6264-6268. [38]MA S,MCDUFF D,SONG Y,et al.Neural TTS Stylization with Adversarial and Collaborative Games.ICLR 2019 Confe-rence Blind Submission[EB/OL].(2019-02-23)[2020-05-26].https://openreview.net/pdf?id=ByzcS3AcYX. [39]TACHIBANA H,UENOYAMA K,AIHARA S.Efficientlytrainable text-to-speech system based on deep convolutional networks with guided attention[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2018:4784-4788. [40]PING W,PENG K,CHEN J,et al.ClariNet:Parallel Wave Ge-neration in End-to-End Text-to-Speech[J].arXiv:1807.07281,2018. [41]PARK J,ZHAO K,PENG K,et al.Multi-Speaker End-to-End Speech Synthesis[J].arXiv:1907.04462,2019. [42]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,Robust and Controllable Text to Speech[C]//Advances in Neural Information Processing Systems.2019:3171-3180. [43]BINKOWSKI M,DONAHUE J,DIELEMAN S,et al.High Fidelity Speech Synthesis with Adversarial Networks[J].arXiv:1909.11646,2019. [44]MOSS H B,AGGARWAL V,PRATEEK N,et al.BOFFINTTS:Few-Shot Speaker Adaptation by Bayesian Optimization[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:7639-7643. [45]WU Z,CHNG E S,LI H,et al.Conditional restricted Boltzmann machine for voice conversion[C]//International Conference on Signal and Information Processing.IEEE,2013:104-108. [46]NAKASHIKA T,TAKIGUCHI T,ARIKI Y.Voice conversion in time-invariant speaker-independent space[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2014:7889-7893. [47]JIAO Y,XIE X,NA X,et al.Improving voice quality of HMM-based speech synthesis using voice conversion method[C]//International Conference on Acoustics Speech and Signal Proces-sing.IEEE,2014:7914-7918. [48]KANEKO T,KAMEOKA H.Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks[J].arXiv:1711.11293,2017. [49]KANEKO T,KAMEOKA H,TANAKA K,et al.CycleGAN-VC2:Improved CycleGAN-based Non-parallel Voice Conversion[C]//International Conference on Acoustics Speech and Signal Processing.IEEE,2019:6820-6824. [50]ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.2017:1125-1134. [51]KAMEOKA H,KANEKO T,TANAKA K,et al.StarGAN-VC:Non-parallel many-to-many voice conversion with star ge-nerative adversarial networks[C]//Spoken Language Technology Workshop.IEEE,2018:266-273. [52]KANEKO T,KAMEOKA H,TANAKA K,et al.StarGAN-VC2:Rethinking Conditional Methods for StarGAN-Based Voice Conversion[C]//Conference of the International Speech Communication Association.2019:679-683. [53]HSU C C,HWANG H T,WU Y C,et al.Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder[C]//Asia Pacific Signal and Information Processing Association Annual Summit and Conference.IEEE,2016:1-6. [54]CHOU J C,LEE H Y.One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization[C]//Conference of the International Speech Communication Association.2019:664-668. [55]QIAN K,ZHANG Y,CHANG S,et al.AUTOVC:Zero-ShotVoice Style Transfer with Only Autoencoder Loss[C]//Proceedings of the 36th International Conference on Machine Learning.PMLR,2019:5210-5219. [56]QIAN K,JIN Z,HASEGAWA-JOHNSON M,et al.F0-consis-tent Many-to-many Non-parallel Voice Conversion via Conditional Autoencoder[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6284-6288. [57]JUNG S,SUH Y,CHOI Y,et al.Non-parallel Voice Conversion Based on Source-to-target Direct Mapping[J].arXiv:2006.06937,2020. [58]POLYAK A,WOLF L,TAIGMAN Y.TTS Skins:SpeakerConversion via ASR[J].arXiv:1904.08983,2019. [59]REBRYK Y,BELIAEV S.ConVoice:Real-Time Zero-ShotVoice Style Transfer with Convolutional Network[J].arXiv:2005.07815,2020. [60]TAO J H,FU R B,YI J Y,et al.Development and Challenge of Speech Forgery and Detection[J].Journal of Cyber Security,2020,5(2):28-38. |
[1] | 冯霞, 胡志毅, 刘才华. 跨模态检索研究进展综述[J]. 计算机科学, 2021, 48(8): 13-23. |
[2] | 周文辉, 石敏, 朱登明, 周军. 基于残差注意力网络的地震数据超分辨率方法[J]. 计算机科学, 2021, 48(8): 24-31. |
[3] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究[J]. 计算机科学, 2021, 48(8): 80-85. |
[4] | 郭琳, 李晨, 陈晨, 赵睿, 范仕霖, 徐星雨. 基于通道注意递归残差网络的图像超分辨率重建[J]. 计算机科学, 2021, 48(8): 139-144. |
[5] | 刘帅, 芮挺, 胡育成, 杨成松, 王东. 基于深度学习SuperGlue算法的单目视觉里程计[J]. 计算机科学, 2021, 48(8): 157-161. |
[6] | 王施云, 杨帆. 基于U-Net特征融合优化策略的遥感影像语义分割方法[J]. 计算机科学, 2021, 48(8): 162-168. |
[7] | 田嵩旺, 蔺素珍, 杨博. 基于多判别器的多波段图像自监督融合方法[J]. 计算机科学, 2021, 48(8): 185-190. |
[8] | 汤世征, 张岩峰. DragDL:一种易用的深度学习模型可视化构建系统[J]. 计算机科学, 2021, 48(8): 220-225. |
[9] | 张瑾, 段利国, 李爱萍, 郝晓燕. 基于注意力与门控机制相结合的细粒度情感分析[J]. 计算机科学, 2021, 48(8): 226-233. |
[10] | 刘文洋, 郭延哺, 李维华. 识别关键蛋白质的混合深度学习模型[J]. 计算机科学, 2021, 48(8): 240-245. |
[11] | 王超, 魏祥麟, 田青, 焦翔, 魏楠, 段强. 基于特征梯度的调制识别深度网络对抗攻击方法[J]. 计算机科学, 2021, 48(7): 25-32. |
[12] | 羊洋, 陈伟, 张丹懿, 王丹妮, 宋爽. 对抗攻击威胁基于卷积神经网络的网络流量分类[J]. 计算机科学, 2021, 48(7): 55-61. |
[13] | 暴雨轩, 芦天亮, 杜彦辉, 石达. 基于i_ResNet34模型和数据增强的深度伪造视频检测方法[J]. 计算机科学, 2021, 48(7): 77-85. |
[14] | 桑春艳, 胥文, 贾朝龙, 文俊浩. 社交网络中基于注意力机制的网络舆情事件演化趋势预测[J]. 计算机科学, 2021, 48(7): 118-123. |
[15] | 徐浩, 刘岳镭. 基于深度学习的无人机声音识别算法[J]. 计算机科学, 2021, 48(7): 225-232. |
|