计算机科学 ›› 2022, Vol. 49 ›› Issue (6A): 301-308.doi: 10.11896/jsjkx.210300134

• 图像处理&多媒体技术 • 上一篇    下一篇

语音风格迁移研究进展

刘畅, 魏为民, 孟繁星, 才智   

  1. 上海电力大学计算机科学与技术学院 上海 200000
  • 出版日期:2022-06-10 发布日期:2022-06-08
  • 通讯作者: 魏为民(wwm@shiep.edu.cn)
  • 作者简介:(lc8025@mail.shiep.edu.cn)
  • 基金资助:
    上海市自然科学基金(16ZR1413100)

Research Progress on Speech Style Transfer

LIU Chang, WEI Wei-min, MENG Fan-xing, CAI Zhi   

  1. School of Computer Science and Technology,Shanghai University of Electric Power,Shanghai 200000,China
  • Online:2022-06-10 Published:2022-06-08
  • About author:LIU Chang,born in 1997,postgraduate.Her main research interests include speech spoof detection.
    WEI Wei-min,born in 1970,Ph.D,assistant professor.His main research interests include information security,ima-ge processing,digital forensics,information hiding and machine learning.
  • Supported by:
    Natural Science Foundation of Shanghai,China(16ZR1413100).

摘要: 语音风格迁移技术指在不改变说话内容的前提下,将源说话人的音色或语音风格转换为目标说话人的音色或语音风格。随着人们对社交媒体隐私保护等方面的迫切需求和基于神经网络篡改技术的快速发展,语音风格迁移技术在领域内被深入研究。在语音风格迁移基本原理的基础上,从声码器、语料对齐以及迁移模型3个重要影响因素的角度对研究现状进行分析,主要包括传统声码器与WaveNet声码器、平行语料与非平行语料以及传统迁移模型与神经网络模型,归纳出目前语音风格迁移技术存在的问题与挑战,并对发展方向进行展望。

关键词: 风格迁移, 神经网络, 声码器, 语料对齐, 语音

Abstract: Speech style transfer technology refers to the conversion of timbre or speech style of the source speaker to timbre or speech style of the target speaker without changing the speech content.With people's urgent need for social media privacy protection and the rapid development of neural network tampering technology,speech style transfer technology has been deeply stu-died in the field.On the basis of introducing the basic principle of speech style transfer,this paper analyzes the research status from the perspective of three important factors vocoder,corpus alignment and transfer model,mainly including traditional vocoder and WaveNet vocoder,parallel corpus,unparallel corpus,conventional migration model and neural network model.It summarizes current problems of speech style transfer technology and challenges,and prospects the future development direction.

Key words: Corpus alignment, Neutral network, Speech, Style transfer, Vocoder

中图分类号: 

  • TP391
[1] ABE M,NAKAMURA S,SHIKANO K,et al.Voice conversionthrough vector quantization[C]//International Conference on Acoustics,Speech,and Signal Processing(ICASSP-88).New York,USA,1988:655-658.
[2] STYLIANOU Y,CAPPE O,MOULINES E.Continuous probabilistic transform for voice conversion[J].IEEE Transactions on Speech and Audio Processing,1998,6(2):131-142.
[3] TAMAMORI A,HAYASHI T,KOBAYASHI K,et al.Speaker-dependent wavenet vocoder[C]//Proceedings of Interspeech.2017:1118-1122.
[4] LING Z H,DENG L,YU D.Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis[C]//IEEE International Conference on Acoustics.IEEE,2013.
[5] KANEKO T,KAMEOKA H,HIRAMATSU K,et al.Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks[C]//Proceedings of the Interspeech,Stockholm.2017:1283-1287.
[6] HAYASHI T,TAMAMORI A,KOBAYASHI K,et al.An in-vestigation of multi-speaker training for waveNet vocoder[C]//IEEE Automatic Speech Recognition and Under-standing Workshop(ASRU).Okinawa,2017:712-718.
[7] ZHANG X W,MIAO X K,ZENG X,et al.Re-search status and Prospect of speech conversion technology[J].Data Acquisition and Processing,2019,34(5):753-770
[8] NARENDRANATH M,MURTHY H A,RAJENDRAN S.Transformation of formants for voice conversion using artificial neural networks[J].Speech Communication,1995,16(2):207-216.
[9] KAWAHARA H.Speech representation and transformationusing adaptive interpolation of weighted spectrum:vocoder revisited[C]//IEEE International Conference on Acoustics,Speech,and Signal Processing.Munich,1997:1303-1306.
[10] AL-RADHI M S,CSAPÓ T G,NÉMETH G.Continuous voco-der applied in deep neural network based voice conversion[J].Multimed Tools Appl,2019,78:33549-33572.
[11] KOBAYASHI K,HAYASHI T,TAMAMORI A,et al.Statistical voice conversion with wavenet-based waveform generation[C]//Proc. Interspeech.2017:1138-1142.
[12] OORD A V D,DIELEMAN S,ZEN H,et al.WaveNet:A gene-rative model for raw audio [EB/OL].(2016-09-12).https://arXiv.org/abs/1609.03499.
[13] NIWA J,YOSHIMURA T,HASHIMOTO K,et al.Statistical voice conversion with WaveNet vocoder[J].arXiv:1907.08940,2020.
[14] HAYASHI T,TAMAMORI A,KOBAYASHI K,et al.An in-vestigation of multi-speaker training for wavenet vocoder[C]//IEEE Automatic Speech Recognition and Under-standing Workshop(ASRU).Okinawa,2017:712-718.
[15] SHEN J,PANG R,WEISS R J,et al.Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions[EB/OL].(2017-12-16).https://arxiv.org/abs/1712.05881.
[16] KOBAYASHI K,HAYASHI T,TAMAMO-RI A,et al.Statistical voice conversion with WaveNet-based waveform generation[C]//Interspeech 2017.Stockholm,Sweden,2017:20-24.
[17] CHEN K,CHEN B,LAI J H,et al.High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder[C]//Interspeech.2018.
[18] SISMAN B,ZHANG M,LI H.Group Sparse Representation with WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2019,27(6):1085-1097.
[19] WU Y C,KOBAYASHI K,HAYASHI T,et al.CollapsedSpeech Segment Detection and Suppression for WaveNet Voco-der[C]//Interspeech.2018:1988-1992.
[20] WU Y,TOBING P L,KOBAYASHI K,et al.Non-ParallelVoice Conversion System With WaveNet Vocoder and Collapsed Speech Suppression[C]//IEEE Access.2020:62094-62106.
[21] HELANDER E,SCHWARZ J,SILEN H,et al.On the impact of alignment on voice conversion performance[C]//9th Annual Conference of the International Speech Communication Association(INTERSPEECH 2008).Brisbane,Australia,2008:22-26.
[22] HSU C C,HWANG H T,WU Y C.Dictionary Update forNMF-based Voice Conversion Using an Encoder-Decoder Network[J].arXiv:1610.03988v1,2016.
[23] SHAH N J,PATIL H A.A novel approach to remove outliers for parallel voice conversion.[J].Computer Speech & Language 2019(58):127-152.
[24] KAMEOKA H,TANAKA K,KWAS'NY D,et al.ConvS2S-VC:Fully Convolutional Sequence-to-Sequence Voice Conversion[C]//IEEE/ACM Transactions on Audio,Speech,and Language Processing.2020:1849-1863.
[25] MOUCHTARIS A,DER SPIEGEL J V,MUELLER P.Nonpa-rallel training for voice conversion based on a parameter adap-tation approach[J].IEEE Trans.Audio,Speech,and Language Processing,2006,14(3):952-963.
[26] DUXANS H,ERRO D,P′EREZ J.Voice Conversion of Non-aligned Data using Unit Selection[C]//TC-STAR Workshop on Speech-to-Speech Translation.Barcelona,Spain,2006:19-21.
[27] LEE C H,WU C H.MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training[C]//ICSLP,Ninth International Conference on Spoken Language Processing(INTERSPEECH 2006).Pittsburgh,PA,USA,2006:17-21.
[28] ERRO D,MORENO A,BONAFONTE A.INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,18(5):944-953.
[29] SAITO D,WATANABE S,NAKAMURA A,et al.Statistical voice conversion based on noisy channel model[J].IEEE Trans.Speech and Audio Processing,2012,20(6):1784-1794.
[30] XIE F L,SOONG F K,LI H.A KL divergence and DNN-based approach to voice conversion without parallel training sentences[C]//Interspeech.2016:287-291.
[31] KINNUNEN T,JUVELA L,ALKU P,etal.Non-parallelvoice conversion using i-vector PLDA:Towards unifying speaker verificationand transfor-mation[C]//Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.2017:5535-5539.
[32] HSU C C,HWANG H T,WU Y C,et al.Voice conversion from non-parallel corpora using variational auto-encoder[C]//2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA).2016:1-6.
[33] HSU C C,HWANG H T,WU Y C,et al.Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial network[C]//Interspeech.2017.
[34] KANEKO T,KAMEOKA H.CycleGAN-VC:Non-parallelVoice Conversion Using Cycle-Consistent Adversarial Networks[C]//2018 26th European Signal Processing Conference(EUSIPCO).Rome,2018:2100-2104.
[35] KANEKO T,KAMEOKA H,TANAKA K,et al.Cyclegan-VC2:Improved CycleGAN-based Non-parallel Voice Conversion[C]//ICASSP.2019:6820-6824.
[36] KAMEOKA H,KANEKO T,TANAKA K,et al.StarGAN-VC:non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks[C]//IEEE Spoken Language Technology Workshop(SLT).Athens,Greece,2018:266-273.
[37] KAMEOKA H,KANEKO T,TANAKA K.ACVAE-VC:Non-parallel many-to-many voice conversion with auxiliary classifier variational auto-encoder[J].arXiv:1806.02169,2018.
[38] LU B.Research on speech Conversion technolog[D].Chengdu:University of Electronic Science and Technology of China,2016.
[39] KAIN A,MACON M W.Spectral voice conversion for text-to-speech synthesis[C]//Proceedings of the 1998 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP '98).1998:285-288.
[40] CHEN Y N,CHU M,CHANG C.Voice conversion withsmoothed GMM and MAP adaptation[C]//8th European Conference on Speech Communication and Technology(EU-ROSPEECH 2003-INTERSPEECH 2003).Geneva,Switzerland,2003:1-4.
[41] TODA T,BLACK A W,TOKUDA K.Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Tra-jectory[J].IEEE Transactions on Audio,Speech,and Language Processing,2007,15(8):2222-2235.
[42] TODA T,OHTANI Y,SHIKANO K.One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices[C]//2007 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP '07).Honolulu,HI,2007:1249-1252.
[43] HELANDER E,VIRTANEN T,NURMINEN J,et al.VoiceConversion Using Partial Least Squares Regression[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,18(5):912-921.
[44] DAISUKE S,NOBUAKI M,KEIKICHI H.Tensor FactorAnalysis for Arbitrary Speaker Conversion[J].IEICE Transactions on Information and Systems,2020,103(6):1395-1405.
[45] MOHAMMADI S H,KAIN A.Voice conversion using deep neural networks with speaker-independent pre-training[C]//2014 IEEE Spoken Language Technology Workshop(SLT).South Lake Tahoe,NV,2014:19-23.
[46] MING H P,HUANG D Y,XIE L.Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion[C]//Interspeech 2016.2016:2016-1053.
[47] CHEN L H,LIU L J,LING Z H.The USTC System for Voice Conversion Challenge 2016:Neural Network Based Approaches for Spectrum,Aperiodicity and F0 Conversion[C]//INTERSPEECH 2016.San Francisco,USA,2016:8-12.
[48] KANEKO T,KAMEOKA H,HIRAMATSU K,et al.Sequence-to-Sequence Voice Conversion with Sim-ilarity Metric Learned Using Generative Adversarial Networks[C]//Interspeech 2017.Stockholm,Sweden,2017:20-24.
[49] KANEKO T,KAMEOKA H,HOJO N.Generative adversative netword-based postfilter for statistical parametric speech synthesis[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2017.
[50] VEKKOT S,GUPTA D,ZAKARIAH M,et al.Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network[J].IEEE Access,2020,8:74627-74647.
[51] STYLIANOU Y.Voice Transformation:A survey[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.Taipei,2009:3585-3588.
[52] HIROSHI M,SHIZUO H,TOSGIO S,et al.MultidimensionalRepresentation of Personal Quality of Vowels and its Acoustical Correlates[J].IEEE Transactions on Audio Electroacoustics,1973,21(5):428-436.
[53] KOBAYASHI K,TODA T,NAKAMURA S.Implementation of F0 transformation for statistical singing voice conversion based on direct waveform modification[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Shanghai,2016:5670-5674.
[54] LEE K S.Voice Conversion Using a Perceptual Criterion[J].Applied Sciences,2020,10(8):28-44.
[55] KOMINEK J,BLACK A W.The CMU Arctic speech databases[C]//Proceedings of Isca Speech Synthesis Workshop.2004.
[56] TODA T,CHEN L H,SAITO D,et al.The Voice Conversion Challenge[C]//Interspeech,2016.2016.
[57] LORENZO-TRUEBA J,YAMAGISHI J,TODA T,et al.TheVoice Conversion Challenge 2018:Promoting Development of Parallel and Nonparallel Methods[C]//Odyssey 2018 The Speaker and Language Recognition Workshop.2018.
[58] ZHAO Y,HUANG W C,TIAN X,et al.Voice ConversionChallenge 2020:Intra-lingual semi-parallel and cross-lingual voice conversion[C]//Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge.Shanghai,China,2020.
[59] WU Y C,TOBING P L,HAYASH T.The NU Non-ParallelVoice Conversion System for the Voice Conversion Challenge[C]//Proc. Odyssey Speaker Lang Recognit Workshop.2018.
[60] CHEN H J,LIANG Q Z,XIE L,et al.Unsupervised acoustic modeling based on DP-GMM:Parallel inference and feasibility study[C]//Proceedings of National Conference on Man-Machine Speech Communication(NCMMSC'2015).2015:69-70.
[61] ZHOU Y,TIAN X,LI H.Multi-Task WaveRNN With an Integrated Architecture for Cross-Lingual Voice Conversion[J].IEEE Signal Processing Letters,2020,27:1310-1314.
[1] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[3] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[4] 王润安, 邹兆年.
基于物理操作级模型的查询执行时间预测方法
Query Performance Prediction Based on Physical Operation-level Models
计算机科学, 2022, 49(8): 49-55. https://doi.org/10.11896/jsjkx.210700074
[5] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[6] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[7] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[8] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[9] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[10] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[11] 齐秀秀, 王佳昊, 李文雄, 周帆.
基于概率元学习的矩阵补全预测融合算法
Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning
计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126
[12] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[13] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[14] 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮.
基于DNGAN的磁共振图像超分辨率重建算法
Super-resolution Reconstruction of MRI Based on DNGAN
计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[15] 刘月红, 牛少华, 神显豪.
基于卷积神经网络的虚拟现实视频帧内预测编码
Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network
计算机科学, 2022, 49(7): 127-131. https://doi.org/10.11896/jsjkx.211100179
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!