Computer Science ›› 2022, Vol. 49 ›› Issue (4): 221-226.doi: 10.11896/jsjkx.210300071

• Computer Graphics & Multimedia • Previous Articles     Next Articles

End-to-End Speech Synthesis Based on BERT

AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji   

  1. School of Computer, Information, Hefei University of Technology, Hefei230601, China;Anhui Province Key Laboratory of Affective Computing, Advanced Intelligent Machine, Hefei University of Technology, Hefei230601, China
  • Received:2021-03-08 Revised:2021-06-04 Published:2022-04-01
  • About author:AN Xin,born in 1987,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include embedded systems and machine learning.DAI Zi-biao,born in 1994,postgra-duate.His main research interests include machine learning and affective computing.
  • Supported by:
    This work was supported by the Joint Funds of the National Natural Science Foundation of China(U1613217),Key Research and Development Projects of Anhui Province of China(202004d07020004) and Fundamental Research Funds for the Central Universities of Ministry of Education of China(JZ2020YYPY0092).

Abstract: To address the problems of low training and prediction efficiency of RNN-based neural network speech synthesis mo-dels and long-distance information loss, an end-to-end BERT-based speech synthesis method is proposed to use the Self-Attention Mechanism instead of RNN as an encoder in the Seq2Seq architecture of speech synthesis.The method uses a pre-trained BERT as the model's Encoder to extract contextual information from the input text content, the Decoder outputs the Mel spectrum by using the same architecture as the speech synthesis model Tacotron2, and finally the trained WaveGlow network is used to transform the Mel spectrum into the final audio result.This method significantly reduces the training parameters and training time by fine-tuning the downstream task based on pre-trained BERT.At the same time, it can also compute the hidden states in the encoder in parallel with its Self-Attention mechanism, thus making full use of the parallel computing power of the GPU to improve the training efficiency and effectively alleviate the remote dependency problem.Through comparison experiments with the Tacotron2 model, the results show that the model proposed in this paper is able to double the training speed while obtaining similar results to the Tacotron2 model.

Key words: Attention mechanism, Recurrent neural network(RNN), Seq2Seq, Speech synthesis, WaveGlow

CLC Number: 

  • TP391
[1] TAYLOR P.Text-to-speech synthesis[M].New York:Cam-bridge University Press,2009.
[2] FUNG P,SCHULTZ T.Multilingual spoken language processing[J].IEEE Signal Processing Magazine,2008,25(3):89-97.
[3] PAN X Q,LU T L,DU Y H,et al.Overview of Speech Synthesis and Voice Converrsion Technology Based on Deep Learning[J].Computer Science,2021,48(8):200-208.
[4] ZHANG B,QUAN C Q,REN J F.Overview of Speech Synthesis in Development and Methods[J].Journal of Chinese Computer System,2016,37(1):186-192.
[5] WANG Y,SKERRY-RYAN R,STANTON D,et al.Tacotron:toward end-to-end speech synthesis[J].arXiv:1703.10135,2017.
[6] GRIFFIN D,LIM J S.Signal estimation from modified short-time Fourier transform[J].1984 IEEE Transactions on Acoustics Speech and Signal Processing,1984,32(2):236-243.
[7] SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[J].Advances in Neural Information Processing Systems,2014,27:3104-3112.
[8] SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2018:4779-4783.
[9] OORD A V D,DIELEMAN,ZEN H,et al.WaveNet:a generative model for raw audio[J].arXiv:1609.03499,2016.
[10] ARIK S O,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural text-to-speech[J].arXiv:1702.07825,2017.
[11] GIBIANSKY A,ARIK S O,DIAMOS G F,et al.Deep Voice 2:Multi-Speaker Neural Text-to-Speech[C]//Proceedings of the Advances in 2017 Neural Information Processing Systems.Uni-ted states:NIPS.2017:2963-2970.
[12] PING W,PENG K,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017.
[13] CHRORWSKI J K,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[J].Advances in Neural Information Processing Systems,2015,28:577-585.
[14] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014.
[15] OORD A,LI Y,BABUSCHKIN I,et al.Parallel wavenet:Fast high-fidelity speech synthesis[C]//Proceedings of the International Conference on Machine Learning.Cambridge MA:JMLR,2018:3918-3926.
[16] PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2019:3617-3621.
[17] KINGMA D P,DHARIWAL P.Glow:Generative flow with invertible 1×1 convolutions[C]//Proceedings of the Advances in 2018 NeuralInformation Processing Systems.United states:NIPS,2018:10215-10224.
[18] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[19] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the Advances in 2017 Neural Information Processing Systems.United states:NIPS,2017:5998-6008.
[20] QIU X,SUN T,XU Y,et al.Pre-trained models for natural language processing:A survey[J].arXiv:2003.08271,2020.
[21] HAO Y,DONG L,WEI F,et al.Visualizing and understanding the effectiveness of BERT[J].arXiv:1908.05620,2019.
[22] PING W,PENG K,CHEN J.Clarinet:Parallel wave generation in end-to-end text-to-speech[J].arXiv:1807.07281,2018.
[23] CUI Y,CHE W,LIU T,et al.Revisiting Pre-Trained Models for Chinese Natural Language Processing[J].arXiv:2004.13922,2020.
[24] KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014.
[1] ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[2] DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[3] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[4] XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182.
[5] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[6] WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[7] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[8] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[9] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[10] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[11] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[12] XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[13] PENG Shuang, WU Jiang-jiang, CHEN Hao, DU Chun, LI Jun. Satellite Onboard Observation Task Planning Based on Attention Neural Network [J]. Computer Science, 2022, 49(7): 242-247.
[14] ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105.
[15] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
Full text



No Suggested Reading articles found!