计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 200-208.doi: 10.11896/jsjkx.200500148

• 人工智能 • 上一篇    下一篇

基于深度学习的语音合成与转换技术综述

潘孝勤, 芦天亮, 杜彦辉, 仝鑫   

  1. 中国人民公安大学信息网络安全学院 北京100038
  • 收稿日期:2020-05-27 修回日期:2020-12-03 发布日期:2021-08-10
  • 通讯作者: 芦天亮(lutianliang@ppsuc.edu.cn)
  • 基金资助:
    国家重点研发计划(2017YFB0802804);中国人民公安大学基本科研业务费重大项目(2020JKF101)

Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning

PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin   

  1. College of Informationand Cyber Security,People's Public Security University of China,Beijing 100038,China
  • Received:2020-05-27 Revised:2020-12-03 Published:2021-08-10
  • About author:PAN Xiao-qin,born in 1997,postgra-duate.Her main research interests include cyber security and artificial intelligence.(m18811328909@163.com)LU Tian-liang,born in 1985,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include cyber security and artificial intelligence.
  • Supported by:
    National Key R&D Program of China(2017YFB0802804) and Fundamental Research Funds for the Central Universities of PPSUC(2020JKF101).

摘要: 语音信息处理技术在深度学习的推动下发展迅速,其中语音合成和转换技术相结合能实现实时高保真的指定对象、内容的语音输出,在人机交互、泛娱乐等领域具有广泛的应用前景。文中旨在对基于深度学习的语音合成与转换技术进行综述。首先,简要回顾了语音合成和转换技术的发展历程;接着,列举了在语音合成、转换领域的常见公开数据集以便研究者开展相关探索;然后,讨论了从文本到语音模型,包括在风格、韵律、速度等方面进行改进的经典和前沿的模型、算法,并分别对比评述了其效果与发展潜力;进一步针对语音转换进行综述,归纳总结了转换方法与优化思路;最后,总结了语音合成与转换的应用与挑战,并根据其在模型、应用和规范方面所面临的问题,展望了未来在模型压缩、少样本学习和伪造检测方面的发展方向。

关键词: 深度学习, 生成对抗网络, 语音合成, 语音信息处理, 语音转换

Abstract: Voice information processing technology is developing rapidly under the impetus of deep learning.The combination of speech synthesis and voice conversion technology can achieve real-time high-fidelity voice output of designated objects and content,and has broad application prospects in man-machine interaction,pan-entertainment and other fields.This paper aims to provide an overview of speech synthesis and voice conversion technology based on deep learning.First,this paper briefly reviews the development of speech synthesis and voice conversion technology.Next,it enumerates the common public datasets in these fields so that it is convenient for researchers to carry out related explorations.Then,it discusses the TTS models,including the classic and cutting-edge models and algorithms in terms of style,rhythm,speed,and compares their effects and development potentials respectively.Then,it reviews voice conversion by summarizing the voice conversion methods and optimization methods.Finally,it summarizes the applications and challenges of speech synthesis and voice conversion,and looks forward to their future development direction in model compression,few-shot learning and forgery detection,based on the problems faced by them in terms of model,application and regulation.

Key words: Deep learning, Generative adversarial networks, Speech synthesis, Voice conversion, Voice information processing

中图分类号: 

  • TP301
[1]RealTalk[OL].https://medium.com/dessa-news/real-talk-speechsynthesis-5dd0897eef7f.
[2]MelNet[OL].https://sjvasquez.github.io/blog/melnet/.
[3]MBIUS B,SPROAT R,SANTEN J,et al.The bell labs German text-to-speech system:an overview[C]//Fifth European Confe-rence on Speech Communication and Technology.1997:22-25.
[4]WU Y J,WANG R H.Minimum Generation Error Training for HMM-Based Speech Synthesis[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2006:89-92.
[5]ZEN H,BRAUNSCHWEILER N.Context-dependent additivelog F0 model for HMM-based speech synthesis[C]//Confe-rence of the International Speech Communication Association.2009:2091-2094.
[6]TODA T,SARUWATARI H,SHIKANO K.Voice conversionalgorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2001:841-844.
[7]AIHARA R,TAKASHIMA R,TAKIGUCHI T,et al.GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features[J].American Journal of Signal Processing,2012,2(5):134-138.
[8]ARIK S O,CHRZANOWSKI M,COATES A,et al.DeepVoice:Real-time Neural Text-to-Speech[J].arXiv:1702.07825,2017.
[9]WANG Y,SKERRYRYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Conference of the International Speech Communication Association.2017:4006-4010.
[10]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative Adversarial Networks[J].Advances in Neural Information Processing Systems,2014,3:2672-2680.
[11]LEMMETTY S.Review of Speech Synthesis Technology[D].Helsinki University of Technology,1999.
[12]ZE H,SENIOR A W,SCHUSTER M,et al.Statistical parametric speech synthesis using deep neural networks[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2013:7962-7966.
[13]LU H,SIMON K,OLIVER W.Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis[C]//The 8th ISCA Speech Synthesis Workshop.2013:261-265.
[14]WU Z,TAKAKI S,YAMAGISHI J.Deep Denoising Auto-encoder for Statistical Speech Synthesis[J].arXiv:1506.05268,2015.
[15]KANG S,QIAN X,MENG H.Multi-distribution deep beliefnetwork for speech synthesis [C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:8012-8016.
[16]YIN X,LING Z H,HU Y J.Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(10):2129-2139.
[17]FERNANDEZ R,RENDEL A,RAMABHADRAN B,et al.Prosody Contour Prediction with Long Short-Term Memory,Bi-Directional,Deep Recurrent Neural Networks[C]//Conference of the International Speech Communication Association.2014:2268-2272.
[18]FAN Y,QIAN Y,XIE F,et al.TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//Conference of the International Speech Communication Association.2014:1964-1968.
[19]DING C,XIE L,YAN J,et al.Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features[C]//2015 IEEE Workshop on Automatic Speech Reco-gnition and Understanding.IEEE,2015:98-102.
[20]OORD A V D,DIELEMAN S,ZEN H,et al.WaveNet:A Gene-rative Model for Raw Audio[J].arXiv:1609.03499,2016.
[21]MEHRI S,KUMAR K,GULRAJANI I,et al.SampleRNN:An Unconditional End-to-End Neural Audio Generation Model[J].arXiv:1612.07837,2016.
[22]KALCHBRENNER N,ELSEN E,SIMONYAN K,et al.Efficient neural audio synthesis[J].arXiv:1802.08435,2018.
[23]OORD A V D,LI Y,BABUSCHKIN I,et al.Parallel WaveNet:Fast High-Fidelity Speech Synthesis[J].arXiv:1711.10433,2017.
[24]PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//InternationalConference on Acoustics,Speech and Signal Processing.IEEE,2019:3617-3621.
[25]ZHAI B,GAO T,XUE F,et al.SqueezeWave:Extremely Lightweight Vocoders for On-device Speech Synthesis[J].arXiv:2001.05685,2020.
[26]ARIK S O,DIAMOS G,GIBIANSKY A,et al.Deep Voice 2:Multi-Speaker Neural Text-to-Speech[C]//Advances in Neural Information Processing Systems.Curran Associates,2017:2962-2970.
[27]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning[J].ar-Xiv:1710.07654,2017.
[28]SHEN J,PANG R,WEISS R,et al.Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018:4779-4783.
[29]SOTELO J,MEHRI S,KUMAR K,et al.Char2Wav:End-to-End Speech Synthesis.ICLR 2017 Workshop Submission[EB/OL].(2017-04-16)[2020-05-26].https://openreview.net/forum?id=B1VWyySKx.
[30]LIU P,WU X,KANG S,et al.Maximizing Mutual Information for Tacotron[J].arXiv:1909.01145,2019.
[31]MING H,HE L,GUO H,et al.Feature reinforcementwith word embedding and parsing information in neural TTS[J].arXiv:1901.00707,2019.
[32]WANG Y,STANTON D,ZHANG Y,et al.Style Tokens:Unsupervised Style Modeling,Control and Transfer in End-to-End Speech Synthesis[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:5180-5189.
[33]LEE Y,KIM T.Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:5911-5915.
[34]ZHANG Y,PAN S,HE L,et al.Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:6945-6949.
[35]AGGARWAL V,COTESCU M,PRATEEK N,et al.UsingVAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech[J].arXiv:1911.12760,2019.
[36]HU T Y,SHRIVASTAVA A,TUZEL O,et al.Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:3267-3271.
[37]SUN G,ZHANG Y,WEISS R J,et al.Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6264-6268.
[38]MA S,MCDUFF D,SONG Y,et al.Neural TTS Stylization with Adversarial and Collaborative Games.ICLR 2019 Confe-rence Blind Submission[EB/OL].(2019-02-23)[2020-05-26].https://openreview.net/pdf?id=ByzcS3AcYX.
[39]TACHIBANA H,UENOYAMA K,AIHARA S.Efficientlytrainable text-to-speech system based on deep convolutional networks with guided attention[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2018:4784-4788.
[40]PING W,PENG K,CHEN J,et al.ClariNet:Parallel Wave Ge-neration in End-to-End Text-to-Speech[J].arXiv:1807.07281,2018.
[41]PARK J,ZHAO K,PENG K,et al.Multi-Speaker End-to-End Speech Synthesis[J].arXiv:1907.04462,2019.
[42]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,Robust and Controllable Text to Speech[C]//Advances in Neural Information Processing Systems.2019:3171-3180.
[43]BINKOWSKI M,DONAHUE J,DIELEMAN S,et al.High Fidelity Speech Synthesis with Adversarial Networks[J].arXiv:1909.11646,2019.
[44]MOSS H B,AGGARWAL V,PRATEEK N,et al.BOFFINTTS:Few-Shot Speaker Adaptation by Bayesian Optimization[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:7639-7643.
[45]WU Z,CHNG E S,LI H,et al.Conditional restricted Boltzmann machine for voice conversion[C]//International Conference on Signal and Information Processing.IEEE,2013:104-108.
[46]NAKASHIKA T,TAKIGUCHI T,ARIKI Y.Voice conversion in time-invariant speaker-independent space[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2014:7889-7893.
[47]JIAO Y,XIE X,NA X,et al.Improving voice quality of HMM-based speech synthesis using voice conversion method[C]//International Conference on Acoustics Speech and Signal Proces-sing.IEEE,2014:7914-7918.
[48]KANEKO T,KAMEOKA H.Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks[J].arXiv:1711.11293,2017.
[49]KANEKO T,KAMEOKA H,TANAKA K,et al.CycleGAN-VC2:Improved CycleGAN-based Non-parallel Voice Conversion[C]//International Conference on Acoustics Speech and Signal Processing.IEEE,2019:6820-6824.
[50]ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.2017:1125-1134.
[51]KAMEOKA H,KANEKO T,TANAKA K,et al.StarGAN-VC:Non-parallel many-to-many voice conversion with star ge-nerative adversarial networks[C]//Spoken Language Technology Workshop.IEEE,2018:266-273.
[52]KANEKO T,KAMEOKA H,TANAKA K,et al.StarGAN-VC2:Rethinking Conditional Methods for StarGAN-Based Voice Conversion[C]//Conference of the International Speech Communication Association.2019:679-683.
[53]HSU C C,HWANG H T,WU Y C,et al.Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder[C]//Asia Pacific Signal and Information Processing Association Annual Summit and Conference.IEEE,2016:1-6.
[54]CHOU J C,LEE H Y.One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization[C]//Conference of the International Speech Communication Association.2019:664-668.
[55]QIAN K,ZHANG Y,CHANG S,et al.AUTOVC:Zero-ShotVoice Style Transfer with Only Autoencoder Loss[C]//Proceedings of the 36th International Conference on Machine Learning.PMLR,2019:5210-5219.
[56]QIAN K,JIN Z,HASEGAWA-JOHNSON M,et al.F0-consis-tent Many-to-many Non-parallel Voice Conversion via Conditional Autoencoder[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6284-6288.
[57]JUNG S,SUH Y,CHOI Y,et al.Non-parallel Voice Conversion Based on Source-to-target Direct Mapping[J].arXiv:2006.06937,2020.
[58]POLYAK A,WOLF L,TAIGMAN Y.TTS Skins:SpeakerConversion via ASR[J].arXiv:1904.08983,2019.
[59]REBRYK Y,BELIAEV S.ConVoice:Real-Time Zero-ShotVoice Style Transfer with Convolutional Network[J].arXiv:2005.07815,2020.
[60]TAO J H,FU R B,YI J Y,et al.Development and Challenge of Speech Forgery and Detection[J].Journal of Cyber Security,2020,5(2):28-38.
[1] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[3] 张佳, 董守斌.
基于评论方面级用户偏好迁移的跨领域推荐算法
Cross-domain Recommendation Based on Review Aspect-level User Preference Transfer
计算机科学, 2022, 49(9): 41-47. https://doi.org/10.11896/jsjkx.220200131
[4] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[5] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[6] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[7] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[8] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[10] 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮.
基于DNGAN的磁共振图像超分辨率重建算法
Super-resolution Reconstruction of MRI Based on DNGAN
计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[11] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[12] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[13] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[14] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[15] 刘伟业, 鲁慧民, 李玉鹏, 马宁.
指静脉识别技术研究综述
Survey on Finger Vein Recognition Research
计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!