计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 200-208.doi: 10.11896/jsjkx.200500148

• 人工智能 • 上一篇    下一篇

基于深度学习的语音合成与转换技术综述

潘孝勤, 芦天亮, 杜彦辉, 仝鑫   

  1. 中国人民公安大学信息网络安全学院 北京100038
  • 收稿日期:2020-05-27 修回日期:2020-12-03 发布日期:2021-08-10
  • 通讯作者: 芦天亮(lutianliang@ppsuc.edu.cn)
  • 基金资助:
    国家重点研发计划(2017YFB0802804);中国人民公安大学基本科研业务费重大项目(2020JKF101)

Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning

PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin   

  1. College of Informationand Cyber Security,People's Public Security University of China,Beijing 100038,China
  • Received:2020-05-27 Revised:2020-12-03 Published:2021-08-10
  • About author:PAN Xiao-qin,born in 1997,postgra-duate.Her main research interests include cyber security and artificial intelligence.(m18811328909@163.com)LU Tian-liang,born in 1985,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include cyber security and artificial intelligence.
  • Supported by:
    National Key R&D Program of China(2017YFB0802804) and Fundamental Research Funds for the Central Universities of PPSUC(2020JKF101).

摘要: 语音信息处理技术在深度学习的推动下发展迅速,其中语音合成和转换技术相结合能实现实时高保真的指定对象、内容的语音输出,在人机交互、泛娱乐等领域具有广泛的应用前景。文中旨在对基于深度学习的语音合成与转换技术进行综述。首先,简要回顾了语音合成和转换技术的发展历程;接着,列举了在语音合成、转换领域的常见公开数据集以便研究者开展相关探索;然后,讨论了从文本到语音模型,包括在风格、韵律、速度等方面进行改进的经典和前沿的模型、算法,并分别对比评述了其效果与发展潜力;进一步针对语音转换进行综述,归纳总结了转换方法与优化思路;最后,总结了语音合成与转换的应用与挑战,并根据其在模型、应用和规范方面所面临的问题,展望了未来在模型压缩、少样本学习和伪造检测方面的发展方向。

关键词: 语音信息处理, 语音合成, 语音转换, 深度学习, 生成对抗网络

Abstract: Voice information processing technology is developing rapidly under the impetus of deep learning.The combination of speech synthesis and voice conversion technology can achieve real-time high-fidelity voice output of designated objects and content,and has broad application prospects in man-machine interaction,pan-entertainment and other fields.This paper aims to provide an overview of speech synthesis and voice conversion technology based on deep learning.First,this paper briefly reviews the development of speech synthesis and voice conversion technology.Next,it enumerates the common public datasets in these fields so that it is convenient for researchers to carry out related explorations.Then,it discusses the TTS models,including the classic and cutting-edge models and algorithms in terms of style,rhythm,speed,and compares their effects and development potentials respectively.Then,it reviews voice conversion by summarizing the voice conversion methods and optimization methods.Finally,it summarizes the applications and challenges of speech synthesis and voice conversion,and looks forward to their future development direction in model compression,few-shot learning and forgery detection,based on the problems faced by them in terms of model,application and regulation.

Key words: Voice information processing, Speech synthesis, Voice conversion, Deep learning, Generative adversarial networks

中图分类号: 

  • TP301
[1]RealTalk[OL].https://medium.com/dessa-news/real-talk-speechsynthesis-5dd0897eef7f.
[2]MelNet[OL].https://sjvasquez.github.io/blog/melnet/.
[3]MBIUS B,SPROAT R,SANTEN J,et al.The bell labs German text-to-speech system:an overview[C]//Fifth European Confe-rence on Speech Communication and Technology.1997:22-25.
[4]WU Y J,WANG R H.Minimum Generation Error Training for HMM-Based Speech Synthesis[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2006:89-92.
[5]ZEN H,BRAUNSCHWEILER N.Context-dependent additivelog F0 model for HMM-based speech synthesis[C]//Confe-rence of the International Speech Communication Association.2009:2091-2094.
[6]TODA T,SARUWATARI H,SHIKANO K.Voice conversionalgorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2001:841-844.
[7]AIHARA R,TAKASHIMA R,TAKIGUCHI T,et al.GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features[J].American Journal of Signal Processing,2012,2(5):134-138.
[8]ARIK S O,CHRZANOWSKI M,COATES A,et al.DeepVoice:Real-time Neural Text-to-Speech[J].arXiv:1702.07825,2017.
[9]WANG Y,SKERRYRYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Conference of the International Speech Communication Association.2017:4006-4010.
[10]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative Adversarial Networks[J].Advances in Neural Information Processing Systems,2014,3:2672-2680.
[11]LEMMETTY S.Review of Speech Synthesis Technology[D].Helsinki University of Technology,1999.
[12]ZE H,SENIOR A W,SCHUSTER M,et al.Statistical parametric speech synthesis using deep neural networks[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2013:7962-7966.
[13]LU H,SIMON K,OLIVER W.Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis[C]//The 8th ISCA Speech Synthesis Workshop.2013:261-265.
[14]WU Z,TAKAKI S,YAMAGISHI J.Deep Denoising Auto-encoder for Statistical Speech Synthesis[J].arXiv:1506.05268,2015.
[15]KANG S,QIAN X,MENG H.Multi-distribution deep beliefnetwork for speech synthesis [C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:8012-8016.
[16]YIN X,LING Z H,HU Y J.Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(10):2129-2139.
[17]FERNANDEZ R,RENDEL A,RAMABHADRAN B,et al.Prosody Contour Prediction with Long Short-Term Memory,Bi-Directional,Deep Recurrent Neural Networks[C]//Conference of the International Speech Communication Association.2014:2268-2272.
[18]FAN Y,QIAN Y,XIE F,et al.TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//Conference of the International Speech Communication Association.2014:1964-1968.
[19]DING C,XIE L,YAN J,et al.Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features[C]//2015 IEEE Workshop on Automatic Speech Reco-gnition and Understanding.IEEE,2015:98-102.
[20]OORD A V D,DIELEMAN S,ZEN H,et al.WaveNet:A Gene-rative Model for Raw Audio[J].arXiv:1609.03499,2016.
[21]MEHRI S,KUMAR K,GULRAJANI I,et al.SampleRNN:An Unconditional End-to-End Neural Audio Generation Model[J].arXiv:1612.07837,2016.
[22]KALCHBRENNER N,ELSEN E,SIMONYAN K,et al.Efficient neural audio synthesis[J].arXiv:1802.08435,2018.
[23]OORD A V D,LI Y,BABUSCHKIN I,et al.Parallel WaveNet:Fast High-Fidelity Speech Synthesis[J].arXiv:1711.10433,2017.
[24]PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//InternationalConference on Acoustics,Speech and Signal Processing.IEEE,2019:3617-3621.
[25]ZHAI B,GAO T,XUE F,et al.SqueezeWave:Extremely Lightweight Vocoders for On-device Speech Synthesis[J].arXiv:2001.05685,2020.
[26]ARIK S O,DIAMOS G,GIBIANSKY A,et al.Deep Voice 2:Multi-Speaker Neural Text-to-Speech[C]//Advances in Neural Information Processing Systems.Curran Associates,2017:2962-2970.
[27]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning[J].ar-Xiv:1710.07654,2017.
[28]SHEN J,PANG R,WEISS R,et al.Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018:4779-4783.
[29]SOTELO J,MEHRI S,KUMAR K,et al.Char2Wav:End-to-End Speech Synthesis.ICLR 2017 Workshop Submission[EB/OL].(2017-04-16)[2020-05-26].https://openreview.net/forum?id=B1VWyySKx.
[30]LIU P,WU X,KANG S,et al.Maximizing Mutual Information for Tacotron[J].arXiv:1909.01145,2019.
[31]MING H,HE L,GUO H,et al.Feature reinforcementwith word embedding and parsing information in neural TTS[J].arXiv:1901.00707,2019.
[32]WANG Y,STANTON D,ZHANG Y,et al.Style Tokens:Unsupervised Style Modeling,Control and Transfer in End-to-End Speech Synthesis[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:5180-5189.
[33]LEE Y,KIM T.Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:5911-5915.
[34]ZHANG Y,PAN S,HE L,et al.Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:6945-6949.
[35]AGGARWAL V,COTESCU M,PRATEEK N,et al.UsingVAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech[J].arXiv:1911.12760,2019.
[36]HU T Y,SHRIVASTAVA A,TUZEL O,et al.Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:3267-3271.
[37]SUN G,ZHANG Y,WEISS R J,et al.Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6264-6268.
[38]MA S,MCDUFF D,SONG Y,et al.Neural TTS Stylization with Adversarial and Collaborative Games.ICLR 2019 Confe-rence Blind Submission[EB/OL].(2019-02-23)[2020-05-26].https://openreview.net/pdf?id=ByzcS3AcYX.
[39]TACHIBANA H,UENOYAMA K,AIHARA S.Efficientlytrainable text-to-speech system based on deep convolutional networks with guided attention[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2018:4784-4788.
[40]PING W,PENG K,CHEN J,et al.ClariNet:Parallel Wave Ge-neration in End-to-End Text-to-Speech[J].arXiv:1807.07281,2018.
[41]PARK J,ZHAO K,PENG K,et al.Multi-Speaker End-to-End Speech Synthesis[J].arXiv:1907.04462,2019.
[42]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,Robust and Controllable Text to Speech[C]//Advances in Neural Information Processing Systems.2019:3171-3180.
[43]BINKOWSKI M,DONAHUE J,DIELEMAN S,et al.High Fidelity Speech Synthesis with Adversarial Networks[J].arXiv:1909.11646,2019.
[44]MOSS H B,AGGARWAL V,PRATEEK N,et al.BOFFINTTS:Few-Shot Speaker Adaptation by Bayesian Optimization[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:7639-7643.
[45]WU Z,CHNG E S,LI H,et al.Conditional restricted Boltzmann machine for voice conversion[C]//International Conference on Signal and Information Processing.IEEE,2013:104-108.
[46]NAKASHIKA T,TAKIGUCHI T,ARIKI Y.Voice conversion in time-invariant speaker-independent space[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2014:7889-7893.
[47]JIAO Y,XIE X,NA X,et al.Improving voice quality of HMM-based speech synthesis using voice conversion method[C]//International Conference on Acoustics Speech and Signal Proces-sing.IEEE,2014:7914-7918.
[48]KANEKO T,KAMEOKA H.Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks[J].arXiv:1711.11293,2017.
[49]KANEKO T,KAMEOKA H,TANAKA K,et al.CycleGAN-VC2:Improved CycleGAN-based Non-parallel Voice Conversion[C]//International Conference on Acoustics Speech and Signal Processing.IEEE,2019:6820-6824.
[50]ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.2017:1125-1134.
[51]KAMEOKA H,KANEKO T,TANAKA K,et al.StarGAN-VC:Non-parallel many-to-many voice conversion with star ge-nerative adversarial networks[C]//Spoken Language Technology Workshop.IEEE,2018:266-273.
[52]KANEKO T,KAMEOKA H,TANAKA K,et al.StarGAN-VC2:Rethinking Conditional Methods for StarGAN-Based Voice Conversion[C]//Conference of the International Speech Communication Association.2019:679-683.
[53]HSU C C,HWANG H T,WU Y C,et al.Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder[C]//Asia Pacific Signal and Information Processing Association Annual Summit and Conference.IEEE,2016:1-6.
[54]CHOU J C,LEE H Y.One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization[C]//Conference of the International Speech Communication Association.2019:664-668.
[55]QIAN K,ZHANG Y,CHANG S,et al.AUTOVC:Zero-ShotVoice Style Transfer with Only Autoencoder Loss[C]//Proceedings of the 36th International Conference on Machine Learning.PMLR,2019:5210-5219.
[56]QIAN K,JIN Z,HASEGAWA-JOHNSON M,et al.F0-consis-tent Many-to-many Non-parallel Voice Conversion via Conditional Autoencoder[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6284-6288.
[57]JUNG S,SUH Y,CHOI Y,et al.Non-parallel Voice Conversion Based on Source-to-target Direct Mapping[J].arXiv:2006.06937,2020.
[58]POLYAK A,WOLF L,TAIGMAN Y.TTS Skins:SpeakerConversion via ASR[J].arXiv:1904.08983,2019.
[59]REBRYK Y,BELIAEV S.ConVoice:Real-Time Zero-ShotVoice Style Transfer with Convolutional Network[J].arXiv:2005.07815,2020.
[60]TAO J H,FU R B,YI J Y,et al.Development and Challenge of Speech Forgery and Detection[J].Journal of Cyber Security,2020,5(2):28-38.
[1] 冯霞, 胡志毅, 刘才华. 跨模态检索研究进展综述[J]. 计算机科学, 2021, 48(8): 13-23.
[2] 周文辉, 石敏, 朱登明, 周军. 基于残差注意力网络的地震数据超分辨率方法[J]. 计算机科学, 2021, 48(8): 24-31.
[3] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究[J]. 计算机科学, 2021, 48(8): 80-85.
[4] 郭琳, 李晨, 陈晨, 赵睿, 范仕霖, 徐星雨. 基于通道注意递归残差网络的图像超分辨率重建[J]. 计算机科学, 2021, 48(8): 139-144.
[5] 刘帅, 芮挺, 胡育成, 杨成松, 王东. 基于深度学习SuperGlue算法的单目视觉里程计[J]. 计算机科学, 2021, 48(8): 157-161.
[6] 王施云, 杨帆. 基于U-Net特征融合优化策略的遥感影像语义分割方法[J]. 计算机科学, 2021, 48(8): 162-168.
[7] 田嵩旺, 蔺素珍, 杨博. 基于多判别器的多波段图像自监督融合方法[J]. 计算机科学, 2021, 48(8): 185-190.
[8] 汤世征, 张岩峰. DragDL:一种易用的深度学习模型可视化构建系统[J]. 计算机科学, 2021, 48(8): 220-225.
[9] 张瑾, 段利国, 李爱萍, 郝晓燕. 基于注意力与门控机制相结合的细粒度情感分析[J]. 计算机科学, 2021, 48(8): 226-233.
[10] 刘文洋, 郭延哺, 李维华. 识别关键蛋白质的混合深度学习模型[J]. 计算机科学, 2021, 48(8): 240-245.
[11] 王超, 魏祥麟, 田青, 焦翔, 魏楠, 段强. 基于特征梯度的调制识别深度网络对抗攻击方法[J]. 计算机科学, 2021, 48(7): 25-32.
[12] 羊洋, 陈伟, 张丹懿, 王丹妮, 宋爽. 对抗攻击威胁基于卷积神经网络的网络流量分类[J]. 计算机科学, 2021, 48(7): 55-61.
[13] 暴雨轩, 芦天亮, 杜彦辉, 石达. 基于i_ResNet34模型和数据增强的深度伪造视频检测方法[J]. 计算机科学, 2021, 48(7): 77-85.
[14] 桑春艳, 胥文, 贾朝龙, 文俊浩. 社交网络中基于注意力机制的网络舆情事件演化趋势预测[J]. 计算机科学, 2021, 48(7): 118-123.
[15] 徐浩, 刘岳镭. 基于深度学习的无人机声音识别算法[J]. 计算机科学, 2021, 48(7): 225-232.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 蔡莉,梁宇,朱扬勇,何婧. 数据质量的历史沿革和发展趋势[J]. 计算机科学, 2018, 45(4): 1 -10 .
[2] 司念文,王衡军,李伟,单义栋,谢鹏程. 基于注意力长短时记忆网络的中文词性标注模型[J]. 计算机科学, 2018, 45(4): 66 -70 .
[3] 瞿中,赵从梅. 一种抗遮挡的自适应尺度目标跟踪算法[J]. 计算机科学, 2018, 45(4): 296 -300 .
[4] 张盼盼, 彭长根, 郝晨艳. 一种基于隐私偏好的隐私保护模型及其量化方法[J]. 计算机科学, 2018, 45(6): 130 -134 .
[5] 郭莹莹, 张丽平, 李松. 障碍环境中线段组最近邻查询方法研究[J]. 计算机科学, 2018, 45(6): 172 -175 .
[6] 陈福才, 李思豪, 张建朋, 黄瑞阳. 基于标签关系改进的多标签特征选择算法[J]. 计算机科学, 2018, 45(6): 228 -234 .
[7] 王振朝,侯欢欢,连蕊. WSN中基于位置预测的地理路由算法[J]. 计算机科学, 2018, 45(5): 59 -63 .
[8] 侯义斌, 梁勋, 占小瑜. 基于区块链的电子证据系统架构模型[J]. 计算机科学, 2018, 45(6A): 348 -351 .
[9] 郑祎能. QKD网络量子信道管理关键技术研究[J]. 计算机科学, 2018, 45(6A): 356 -363 .
[10] 薛参观, 燕雪峰. 基于改进深度森林算法的软件缺陷预测[J]. 计算机科学, 2018, 45(8): 160 -165 .