基于深度学习的语音合成与转换技术综述

doi:10.11896/jsjkx.200500148

Abstract

Abstract: Voice information processing technology is developing rapidly under the impetus of deep learning.The combination of speech synthesis and voice conversion technology can achieve real-time high-fidelity voice output of designated objects and content,and has broad application prospects in man-machine interaction,pan-entertainment and other fields.This paper aims to provide an overview of speech synthesis and voice conversion technology based on deep learning.First,this paper briefly reviews the development of speech synthesis and voice conversion technology.Next,it enumerates the common public datasets in these fields so that it is convenient for researchers to carry out related explorations.Then,it discusses the TTS models,including the classic and cutting-edge models and algorithms in terms of style,rhythm,speed,and compares their effects and development potentials respectively.Then,it reviews voice conversion by summarizing the voice conversion methods and optimization methods.Finally,it summarizes the applications and challenges of speech synthesis and voice conversion,and looks forward to their future development direction in model compression,few-shot learning and forgery detection,based on the problems faced by them in terms of model,application and regulation.

Key words: Deep learning, Generative adversarial networks, Speech synthesis, Voice conversion, Voice information processing

CLC Number:

TP301

PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin. Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning[J].Computer Science, 2021, 48(8): 200-208.

References

[1]RealTalk[OL].https://medium.com/dessa-news/real-talk-speechsynthesis-5dd0897eef7f.
[2]MelNet[OL].https://sjvasquez.github.io/blog/melnet/.
[3]MBIUS B,SPROAT R,SANTEN J,et al.The bell labs German text-to-speech system:an overview[C]//Fifth European Confe-rence on Speech Communication and Technology.1997:22-25.
[4]WU Y J,WANG R H.Minimum Generation Error Training for HMM-Based Speech Synthesis[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2006:89-92.
[5]ZEN H,BRAUNSCHWEILER N.Context-dependent additivelog F0 model for HMM-based speech synthesis[C]//Confe-rence of the International Speech Communication Association.2009:2091-2094.
[6]TODA T,SARUWATARI H,SHIKANO K.Voice conversionalgorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2001:841-844.
[7]AIHARA R,TAKASHIMA R,TAKIGUCHI T,et al.GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features[J].American Journal of Signal Processing,2012,2(5):134-138.
[8]ARIK S O,CHRZANOWSKI M,COATES A,et al.DeepVoice:Real-time Neural Text-to-Speech[J].arXiv:1702.07825,2017.
[9]WANG Y,SKERRYRYAN R J,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Conference of the International Speech Communication Association.2017:4006-4010.
[10]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative Adversarial Networks[J].Advances in Neural Information Processing Systems,2014,3:2672-2680.
[11]LEMMETTY S.Review of Speech Synthesis Technology[D].Helsinki University of Technology,1999.
[12]ZE H,SENIOR A W,SCHUSTER M,et al.Statistical parametric speech synthesis using deep neural networks[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2013:7962-7966.
[13]LU H,SIMON K,OLIVER W.Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis[C]//The 8th ISCA Speech Synthesis Workshop.2013:261-265.
[14]WU Z,TAKAKI S,YAMAGISHI J.Deep Denoising Auto-encoder for Statistical Speech Synthesis[J].arXiv:1506.05268,2015.
[15]KANG S,QIAN X,MENG H.Multi-distribution deep beliefnetwork for speech synthesis [C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:8012-8016.
[16]YIN X,LING Z H,HU Y J.Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(10):2129-2139.
[17]FERNANDEZ R,RENDEL A,RAMABHADRAN B,et al.Prosody Contour Prediction with Long Short-Term Memory,Bi-Directional,Deep Recurrent Neural Networks[C]//Conference of the International Speech Communication Association.2014:2268-2272.
[18]FAN Y,QIAN Y,XIE F,et al.TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//Conference of the International Speech Communication Association.2014:1964-1968.
[19]DING C,XIE L,YAN J,et al.Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features[C]//2015 IEEE Workshop on Automatic Speech Reco-gnition and Understanding.IEEE,2015:98-102.
[20]OORD A V D,DIELEMAN S,ZEN H,et al.WaveNet:A Gene-rative Model for Raw Audio[J].arXiv:1609.03499,2016.
[21]MEHRI S,KUMAR K,GULRAJANI I,et al.SampleRNN:An Unconditional End-to-End Neural Audio Generation Model[J].arXiv:1612.07837,2016.
[22]KALCHBRENNER N,ELSEN E,SIMONYAN K,et al.Efficient neural audio synthesis[J].arXiv:1802.08435,2018.
[23]OORD A V D,LI Y,BABUSCHKIN I,et al.Parallel WaveNet:Fast High-Fidelity Speech Synthesis[J].arXiv:1711.10433,2017.
[24]PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//InternationalConference on Acoustics,Speech and Signal Processing.IEEE,2019:3617-3621.
[25]ZHAI B,GAO T,XUE F,et al.SqueezeWave:Extremely Lightweight Vocoders for On-device Speech Synthesis[J].arXiv:2001.05685,2020.
[26]ARIK S O,DIAMOS G,GIBIANSKY A,et al.Deep Voice 2:Multi-Speaker Neural Text-to-Speech[C]//Advances in Neural Information Processing Systems.Curran Associates,2017:2962-2970.
[27]PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning[J].ar-Xiv:1710.07654,2017.
[28]SHEN J,PANG R,WEISS R,et al.Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions[C]//International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018:4779-4783.
[29]SOTELO J,MEHRI S,KUMAR K,et al.Char2Wav:End-to-End Speech Synthesis.ICLR 2017 Workshop Submission[EB/OL].(2017-04-16)[2020-05-26].https://openreview.net/forum?id=B1VWyySKx.
[30]LIU P,WU X,KANG S,et al.Maximizing Mutual Information for Tacotron[J].arXiv:1909.01145,2019.
[31]MING H,HE L,GUO H,et al.Feature reinforcementwith word embedding and parsing information in neural TTS[J].arXiv:1901.00707,2019.
[32]WANG Y,STANTON D,ZHANG Y,et al.Style Tokens:Unsupervised Style Modeling,Control and Transfer in End-to-End Speech Synthesis[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:5180-5189.
[33]LEE Y,KIM T.Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:5911-5915.
[34]ZHANG Y,PAN S,HE L,et al.Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2019:6945-6949.
[35]AGGARWAL V,COTESCU M,PRATEEK N,et al.UsingVAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech[J].arXiv:1911.12760,2019.
[36]HU T Y,SHRIVASTAVA A,TUZEL O,et al.Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:3267-3271.
[37]SUN G,ZHANG Y,WEISS R J,et al.Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6264-6268.
[38]MA S,MCDUFF D,SONG Y,et al.Neural TTS Stylization with Adversarial and Collaborative Games.ICLR 2019 Confe-rence Blind Submission[EB/OL].(2019-02-23)[2020-05-26].https://openreview.net/pdf?id=ByzcS3AcYX.
[39]TACHIBANA H,UENOYAMA K,AIHARA S.Efficientlytrainable text-to-speech system based on deep convolutional networks with guided attention[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2018:4784-4788.
[40]PING W,PENG K,CHEN J,et al.ClariNet:Parallel Wave Ge-neration in End-to-End Text-to-Speech[J].arXiv:1807.07281,2018.
[41]PARK J,ZHAO K,PENG K,et al.Multi-Speaker End-to-End Speech Synthesis[J].arXiv:1907.04462,2019.
[42]REN Y,RUAN Y,TAN X,et al.FastSpeech:Fast,Robust and Controllable Text to Speech[C]//Advances in Neural Information Processing Systems.2019:3171-3180.
[43]BINKOWSKI M,DONAHUE J,DIELEMAN S,et al.High Fidelity Speech Synthesis with Adversarial Networks[J].arXiv:1909.11646,2019.
[44]MOSS H B,AGGARWAL V,PRATEEK N,et al.BOFFINTTS:Few-Shot Speaker Adaptation by Bayesian Optimization[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:7639-7643.
[45]WU Z,CHNG E S,LI H,et al.Conditional restricted Boltzmann machine for voice conversion[C]//International Conference on Signal and Information Processing.IEEE,2013:104-108.
[46]NAKASHIKA T,TAKIGUCHI T,ARIKI Y.Voice conversion in time-invariant speaker-independent space[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2014:7889-7893.
[47]JIAO Y,XIE X,NA X,et al.Improving voice quality of HMM-based speech synthesis using voice conversion method[C]//International Conference on Acoustics Speech and Signal Proces-sing.IEEE,2014:7914-7918.
[48]KANEKO T,KAMEOKA H.Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks[J].arXiv:1711.11293,2017.
[49]KANEKO T,KAMEOKA H,TANAKA K,et al.CycleGAN-VC2:Improved CycleGAN-based Non-parallel Voice Conversion[C]//International Conference on Acoustics Speech and Signal Processing.IEEE,2019:6820-6824.
[50]ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.2017:1125-1134.
[51]KAMEOKA H,KANEKO T,TANAKA K,et al.StarGAN-VC:Non-parallel many-to-many voice conversion with star ge-nerative adversarial networks[C]//Spoken Language Technology Workshop.IEEE,2018:266-273.
[52]KANEKO T,KAMEOKA H,TANAKA K,et al.StarGAN-VC2:Rethinking Conditional Methods for StarGAN-Based Voice Conversion[C]//Conference of the International Speech Communication Association.2019:679-683.
[53]HSU C C,HWANG H T,WU Y C,et al.Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder[C]//Asia Pacific Signal and Information Processing Association Annual Summit and Conference.IEEE,2016:1-6.
[54]CHOU J C,LEE H Y.One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization[C]//Conference of the International Speech Communication Association.2019:664-668.
[55]QIAN K,ZHANG Y,CHANG S,et al.AUTOVC:Zero-ShotVoice Style Transfer with Only Autoencoder Loss[C]//Proceedings of the 36th International Conference on Machine Learning.PMLR,2019:5210-5219.
[56]QIAN K,JIN Z,HASEGAWA-JOHNSON M,et al.F0-consis-tent Many-to-many Non-parallel Voice Conversion via Conditional Autoencoder[C]//International Conference on Acoustics,Speech and Signal Processing.IEEE,2020:6284-6288.
[57]JUNG S,SUH Y,CHOI Y,et al.Non-parallel Voice Conversion Based on Source-to-target Direct Mapping[J].arXiv:2006.06937,2020.
[58]POLYAK A,WOLF L,TAIGMAN Y.TTS Skins:SpeakerConversion via ASR[J].arXiv:1904.08983,2019.
[59]REBRYK Y,BELIAEV S.ConVoice:Real-Time Zero-ShotVoice Style Transfer with Convolutional Network[J].arXiv:2005.07815,2020.
[60]TAO J H,FU R B,YI J Y,et al.Development and Challenge of Speech Forgery and Detection[J].Journal of Cyber Security,2020,5(2):28-38.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[2]	RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[3]	TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[4]	WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[5]	HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[6]	JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7]	SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[8]	HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[9]	CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[10]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[11]	ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[12]	SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[13]	ZHU Wen-tao, LAN Xian-chao, LUO Huan-lin, YUE Bing, WANG Yang. Remote Sensing Aircraft Target Detection Based on Improved Faster R-CNN [J]. Computer Science, 2022, 49(6A): 378-383.
[14]	WANG Jian-ming, CHEN Xiang-yu, YANG Zi-zhong, SHI Chen-yang, ZHANG Yu-hang, QIAN Zheng-kun. Influence of Different Data Augmentation Methods on Model Recognition Accuracy [J]. Computer Science, 2022, 49(6A): 418-423.
[15]	MAO Dian-hui, HUANG Hui-yu, ZHAO Shuang. Study on Automatic Synthetic News Detection Method Complying with Regulatory Compliance [J]. Computer Science, 2022, 49(6A): 523-530.