Computer Science ›› 2022, Vol. 49 ›› Issue (6A): 301-308.doi: 10.11896/jsjkx.210300134

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Research Progress on Speech Style Transfer

LIU Chang, WEI Wei-min, MENG Fan-xing, CAI Zhi   

  1. School of Computer Science and Technology,Shanghai University of Electric Power,Shanghai 200000,China
  • Online:2022-06-10 Published:2022-06-08
  • About author:LIU Chang,born in 1997,postgraduate.Her main research interests include speech spoof detection.
    WEI Wei-min,born in 1970,Ph.D,assistant professor.His main research interests include information security,ima-ge processing,digital forensics,information hiding and machine learning.
  • Supported by:
    Natural Science Foundation of Shanghai,China(16ZR1413100).

Abstract: Speech style transfer technology refers to the conversion of timbre or speech style of the source speaker to timbre or speech style of the target speaker without changing the speech content.With people's urgent need for social media privacy protection and the rapid development of neural network tampering technology,speech style transfer technology has been deeply stu-died in the field.On the basis of introducing the basic principle of speech style transfer,this paper analyzes the research status from the perspective of three important factors vocoder,corpus alignment and transfer model,mainly including traditional vocoder and WaveNet vocoder,parallel corpus,unparallel corpus,conventional migration model and neural network model.It summarizes current problems of speech style transfer technology and challenges,and prospects the future development direction.

Key words: Corpus alignment, Neutral network, Speech, Style transfer, Vocoder

CLC Number: 

  • TP391
[1] ABE M,NAKAMURA S,SHIKANO K,et al.Voice conversionthrough vector quantization[C]//International Conference on Acoustics,Speech,and Signal Processing(ICASSP-88).New York,USA,1988:655-658.
[2] STYLIANOU Y,CAPPE O,MOULINES E.Continuous probabilistic transform for voice conversion[J].IEEE Transactions on Speech and Audio Processing,1998,6(2):131-142.
[3] TAMAMORI A,HAYASHI T,KOBAYASHI K,et al.Speaker-dependent wavenet vocoder[C]//Proceedings of Interspeech.2017:1118-1122.
[4] LING Z H,DENG L,YU D.Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis[C]//IEEE International Conference on Acoustics.IEEE,2013.
[5] KANEKO T,KAMEOKA H,HIRAMATSU K,et al.Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks[C]//Proceedings of the Interspeech,Stockholm.2017:1283-1287.
[6] HAYASHI T,TAMAMORI A,KOBAYASHI K,et al.An in-vestigation of multi-speaker training for waveNet vocoder[C]//IEEE Automatic Speech Recognition and Under-standing Workshop(ASRU).Okinawa,2017:712-718.
[7] ZHANG X W,MIAO X K,ZENG X,et al.Re-search status and Prospect of speech conversion technology[J].Data Acquisition and Processing,2019,34(5):753-770
[8] NARENDRANATH M,MURTHY H A,RAJENDRAN S.Transformation of formants for voice conversion using artificial neural networks[J].Speech Communication,1995,16(2):207-216.
[9] KAWAHARA H.Speech representation and transformationusing adaptive interpolation of weighted spectrum:vocoder revisited[C]//IEEE International Conference on Acoustics,Speech,and Signal Processing.Munich,1997:1303-1306.
[10] AL-RADHI M S,CSAPÓ T G,NÉMETH G.Continuous voco-der applied in deep neural network based voice conversion[J].Multimed Tools Appl,2019,78:33549-33572.
[11] KOBAYASHI K,HAYASHI T,TAMAMORI A,et al.Statistical voice conversion with wavenet-based waveform generation[C]//Proc. Interspeech.2017:1138-1142.
[12] OORD A V D,DIELEMAN S,ZEN H,et al.WaveNet:A gene-rative model for raw audio [EB/OL].(2016-09-12).https://arXiv.org/abs/1609.03499.
[13] NIWA J,YOSHIMURA T,HASHIMOTO K,et al.Statistical voice conversion with WaveNet vocoder[J].arXiv:1907.08940,2020.
[14] HAYASHI T,TAMAMORI A,KOBAYASHI K,et al.An in-vestigation of multi-speaker training for wavenet vocoder[C]//IEEE Automatic Speech Recognition and Under-standing Workshop(ASRU).Okinawa,2017:712-718.
[15] SHEN J,PANG R,WEISS R J,et al.Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions[EB/OL].(2017-12-16).https://arxiv.org/abs/1712.05881.
[16] KOBAYASHI K,HAYASHI T,TAMAMO-RI A,et al.Statistical voice conversion with WaveNet-based waveform generation[C]//Interspeech 2017.Stockholm,Sweden,2017:20-24.
[17] CHEN K,CHEN B,LAI J H,et al.High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder[C]//Interspeech.2018.
[18] SISMAN B,ZHANG M,LI H.Group Sparse Representation with WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2019,27(6):1085-1097.
[19] WU Y C,KOBAYASHI K,HAYASHI T,et al.CollapsedSpeech Segment Detection and Suppression for WaveNet Voco-der[C]//Interspeech.2018:1988-1992.
[20] WU Y,TOBING P L,KOBAYASHI K,et al.Non-ParallelVoice Conversion System With WaveNet Vocoder and Collapsed Speech Suppression[C]//IEEE Access.2020:62094-62106.
[21] HELANDER E,SCHWARZ J,SILEN H,et al.On the impact of alignment on voice conversion performance[C]//9th Annual Conference of the International Speech Communication Association(INTERSPEECH 2008).Brisbane,Australia,2008:22-26.
[22] HSU C C,HWANG H T,WU Y C.Dictionary Update forNMF-based Voice Conversion Using an Encoder-Decoder Network[J].arXiv:1610.03988v1,2016.
[23] SHAH N J,PATIL H A.A novel approach to remove outliers for parallel voice conversion.[J].Computer Speech & Language 2019(58):127-152.
[24] KAMEOKA H,TANAKA K,KWAS'NY D,et al.ConvS2S-VC:Fully Convolutional Sequence-to-Sequence Voice Conversion[C]//IEEE/ACM Transactions on Audio,Speech,and Language Processing.2020:1849-1863.
[25] MOUCHTARIS A,DER SPIEGEL J V,MUELLER P.Nonpa-rallel training for voice conversion based on a parameter adap-tation approach[J].IEEE Trans.Audio,Speech,and Language Processing,2006,14(3):952-963.
[26] DUXANS H,ERRO D,P′EREZ J.Voice Conversion of Non-aligned Data using Unit Selection[C]//TC-STAR Workshop on Speech-to-Speech Translation.Barcelona,Spain,2006:19-21.
[27] LEE C H,WU C H.MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training[C]//ICSLP,Ninth International Conference on Spoken Language Processing(INTERSPEECH 2006).Pittsburgh,PA,USA,2006:17-21.
[28] ERRO D,MORENO A,BONAFONTE A.INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,18(5):944-953.
[29] SAITO D,WATANABE S,NAKAMURA A,et al.Statistical voice conversion based on noisy channel model[J].IEEE Trans.Speech and Audio Processing,2012,20(6):1784-1794.
[30] XIE F L,SOONG F K,LI H.A KL divergence and DNN-based approach to voice conversion without parallel training sentences[C]//Interspeech.2016:287-291.
[31] KINNUNEN T,JUVELA L,ALKU P,etal.Non-parallelvoice conversion using i-vector PLDA:Towards unifying speaker verificationand transfor-mation[C]//Proc.IEEE Int.Conf.Acoust.,Speech,Signal Process.2017:5535-5539.
[32] HSU C C,HWANG H T,WU Y C,et al.Voice conversion from non-parallel corpora using variational auto-encoder[C]//2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA).2016:1-6.
[33] HSU C C,HWANG H T,WU Y C,et al.Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial network[C]//Interspeech.2017.
[34] KANEKO T,KAMEOKA H.CycleGAN-VC:Non-parallelVoice Conversion Using Cycle-Consistent Adversarial Networks[C]//2018 26th European Signal Processing Conference(EUSIPCO).Rome,2018:2100-2104.
[35] KANEKO T,KAMEOKA H,TANAKA K,et al.Cyclegan-VC2:Improved CycleGAN-based Non-parallel Voice Conversion[C]//ICASSP.2019:6820-6824.
[36] KAMEOKA H,KANEKO T,TANAKA K,et al.StarGAN-VC:non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks[C]//IEEE Spoken Language Technology Workshop(SLT).Athens,Greece,2018:266-273.
[37] KAMEOKA H,KANEKO T,TANAKA K.ACVAE-VC:Non-parallel many-to-many voice conversion with auxiliary classifier variational auto-encoder[J].arXiv:1806.02169,2018.
[38] LU B.Research on speech Conversion technolog[D].Chengdu:University of Electronic Science and Technology of China,2016.
[39] KAIN A,MACON M W.Spectral voice conversion for text-to-speech synthesis[C]//Proceedings of the 1998 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP '98).1998:285-288.
[40] CHEN Y N,CHU M,CHANG C.Voice conversion withsmoothed GMM and MAP adaptation[C]//8th European Conference on Speech Communication and Technology(EU-ROSPEECH 2003-INTERSPEECH 2003).Geneva,Switzerland,2003:1-4.
[41] TODA T,BLACK A W,TOKUDA K.Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Tra-jectory[J].IEEE Transactions on Audio,Speech,and Language Processing,2007,15(8):2222-2235.
[42] TODA T,OHTANI Y,SHIKANO K.One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices[C]//2007 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP '07).Honolulu,HI,2007:1249-1252.
[43] HELANDER E,VIRTANEN T,NURMINEN J,et al.VoiceConversion Using Partial Least Squares Regression[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,18(5):912-921.
[44] DAISUKE S,NOBUAKI M,KEIKICHI H.Tensor FactorAnalysis for Arbitrary Speaker Conversion[J].IEICE Transactions on Information and Systems,2020,103(6):1395-1405.
[45] MOHAMMADI S H,KAIN A.Voice conversion using deep neural networks with speaker-independent pre-training[C]//2014 IEEE Spoken Language Technology Workshop(SLT).South Lake Tahoe,NV,2014:19-23.
[46] MING H P,HUANG D Y,XIE L.Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion[C]//Interspeech 2016.2016:2016-1053.
[47] CHEN L H,LIU L J,LING Z H.The USTC System for Voice Conversion Challenge 2016:Neural Network Based Approaches for Spectrum,Aperiodicity and F0 Conversion[C]//INTERSPEECH 2016.San Francisco,USA,2016:8-12.
[48] KANEKO T,KAMEOKA H,HIRAMATSU K,et al.Sequence-to-Sequence Voice Conversion with Sim-ilarity Metric Learned Using Generative Adversarial Networks[C]//Interspeech 2017.Stockholm,Sweden,2017:20-24.
[49] KANEKO T,KAMEOKA H,HOJO N.Generative adversative netword-based postfilter for statistical parametric speech synthesis[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2017.
[50] VEKKOT S,GUPTA D,ZAKARIAH M,et al.Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network[J].IEEE Access,2020,8:74627-74647.
[51] STYLIANOU Y.Voice Transformation:A survey[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.Taipei,2009:3585-3588.
[52] HIROSHI M,SHIZUO H,TOSGIO S,et al.MultidimensionalRepresentation of Personal Quality of Vowels and its Acoustical Correlates[J].IEEE Transactions on Audio Electroacoustics,1973,21(5):428-436.
[53] KOBAYASHI K,TODA T,NAKAMURA S.Implementation of F0 transformation for statistical singing voice conversion based on direct waveform modification[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Shanghai,2016:5670-5674.
[54] LEE K S.Voice Conversion Using a Perceptual Criterion[J].Applied Sciences,2020,10(8):28-44.
[55] KOMINEK J,BLACK A W.The CMU Arctic speech databases[C]//Proceedings of Isca Speech Synthesis Workshop.2004.
[56] TODA T,CHEN L H,SAITO D,et al.The Voice Conversion Challenge[C]//Interspeech,2016.2016.
[57] LORENZO-TRUEBA J,YAMAGISHI J,TODA T,et al.TheVoice Conversion Challenge 2018:Promoting Development of Parallel and Nonparallel Methods[C]//Odyssey 2018 The Speaker and Language Recognition Workshop.2018.
[58] ZHAO Y,HUANG W C,TIAN X,et al.Voice ConversionChallenge 2020:Intra-lingual semi-parallel and cross-lingual voice conversion[C]//Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge.Shanghai,China,2020.
[59] WU Y C,TOBING P L,HAYASH T.The NU Non-ParallelVoice Conversion System for the Voice Conversion Challenge[C]//Proc. Odyssey Speaker Lang Recognit Workshop.2018.
[60] CHEN H J,LIANG Q Z,XIE L,et al.Unsupervised acoustic modeling based on DP-GMM:Parallel inference and feasibility study[C]//Proceedings of National Conference on Man-Machine Speech Communication(NCMMSC'2015).2015:69-70.
[61] ZHOU Y,TIAN X,LI H.Multi-Task WaveRNN With an Integrated Architecture for Cross-Lingual Voice Conversion[J].IEEE Signal Processing Letters,2020,27:1310-1314.
[1] ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105.
[2] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[3] YANG Yue, FENG Tao, LIANG Hong, YANG Yang. Image Arbitrary Style Transfer via Criss-cross Attention [J]. Computer Science, 2022, 49(6A): 345-352.
[4] YIN Wen-bing, GAO Ge, ZENG Bang, WANG Xiao, CHEN Yi. Speech Enhancement Based on Time-Frequency Domain GAN [J]. Computer Science, 2022, 49(6): 187-192.
[5] AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji. End-to-End Speech Synthesis Based on BERT [J]. Computer Science, 2022, 49(4): 221-226.
[6] CHENG Gao-feng, YAN Yong-hong. Latest Development of Multilingual Speech Recognition Acoustic Model Modeling Methods [J]. Computer Science, 2022, 49(1): 47-52.
[7] YANG Run-yan, CHENG Gao-feng, LIU Jian. Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition [J]. Computer Science, 2022, 49(1): 53-58.
[8] PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin. Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning [J]. Computer Science, 2021, 48(8): 200-208.
[9] YE Hong-liang, ZHU Wan-ning, HONG Lei. Music Style Transfer Method with Human Voice Based on CQT and Mel-spectrum [J]. Computer Science, 2021, 48(6A): 326-330.
[10] ZHANG Zi-cheng, TAN Zhi-wei, ZHANG Chen-rui, WANG Xuan, LIU Xiao-xuan, YU Yi-biao. Speech Endpoint Detection Based on Bayesian Decision of Logarithmic Power Spectrum Ratio in High and Low Frequency Band [J]. Computer Science, 2021, 48(6A): 33-37.
[11] LI Yu-rong, LIU Jie, LIU Ya-lin, GONG Chun-ye, WANG Yong. Parallel Algorithm of Deep Transductive Non-negative Matrix Factorization for Speech Separation [J]. Computer Science, 2020, 47(8): 49-55.
[12] ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks [J]. Computer Science, 2020, 47(5): 110-119.
[13] LIU Xin-yi,TIAN Wei-wei,LIANG Wen-ru,HE Ling,YIN Heng. Automatic Detection Algorithm of Nasal Leak in Cleft Palate Speech Based on Recursive Plot Analysis [J]. Computer Science, 2020, 47(2): 95-101.
[14] ZHANG Jing, YANG Jian, SU Peng. Survey of Monosyllable Recognition in Speech Recognition [J]. Computer Science, 2020, 47(11A): 172-174.
[15] ZHANG Mei-yu, LIU Yue-hui, QIN Xu-jia, WU Liang-wu. Neural Style Transfer Method Based on Laplace Operator to Suppress Artifacts [J]. Computer Science, 2020, 47(11A): 209-214.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!