计算机科学 ›› 2022, Vol. 49 ›› Issue (4): 221-226.doi: 10.11896/jsjkx.210300071

• 计算机图形学&多媒体 • 上一篇    下一篇

基于BERT的端到端语音合成方法

安鑫, 代子彪, 李阳, 孙晓, 任福继   

  1. 合肥工业大学计算机与信息学院 合肥 230601; 合肥工业大学情感计算与先进智能机器安徽省重点实验室 合肥 230601
  • 收稿日期:2021-03-08 修回日期:2021-06-04 发布日期:2022-04-01
  • 通讯作者: 代子彪(1224269321@qq.com)
  • 作者简介:(xin.an@hfut.edu.cn)
  • 基金资助:
    国家自然科学基金联合资助项目(U1613217); 安徽省重点研究与开发计划项目(202004d07020004); 中央高校基本科研业务专项资金(JZ2020YYPY0092)

End-to-End Speech Synthesis Based on BERT

AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji   

  1. School of Computer, Information, Hefei University of Technology, Hefei230601, China;Anhui Province Key Laboratory of Affective Computing, Advanced Intelligent Machine, Hefei University of Technology, Hefei230601, China
  • Received:2021-03-08 Revised:2021-06-04 Published:2022-04-01
  • About author:AN Xin,born in 1987,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include embedded systems and machine learning.DAI Zi-biao,born in 1994,postgra-duate.His main research interests include machine learning and affective computing.
  • Supported by:
    This work was supported by the Joint Funds of the National Natural Science Foundation of China(U1613217),Key Research and Development Projects of Anhui Province of China(202004d07020004) and Fundamental Research Funds for the Central Universities of Ministry of Education of China(JZ2020YYPY0092).

摘要: 针对基于RNN的神经网络语音合成模型训练和预测效率低下以及长距离信息丢失的问题,提出了一种基于BERT的端到端语音合成方法,在语音合成的Seq2Seq架构中使用自注意力机制(Self-Attention Mechanism)取代RNN作为编码器。该方法使用预训练好的BERT作为模型的编码器(Encoder)从输入的文本内容中提取上下文信息,解码器(Decoder)采用与语音合成模型Tacotron2相同的架构输出梅尔频谱,最后使用训练好的WaveGlow网络将梅尔频谱转化为最终的音频结果。该方法在预训练BERT的基础上通过微调适配下游任务来大幅度减少训练参数和训练时间。同时,借助其自注意力(Self-Attention)机制还可以并行计算编码器中的隐藏状态,从而充分利用GPU的并行计算能力以提高训练效率,并能有效缓解远程依赖问题。与Tacotron2模型的对比实验表明,文中提出的模型能够在得到与Tacotron2模型相近效果的基础上,把训练速度提升1倍左右。

关键词: Seq2Seq, WaveGlow, 循环神经网络, 语音合成, 注意力机制

Abstract: To address the problems of low training and prediction efficiency of RNN-based neural network speech synthesis mo-dels and long-distance information loss, an end-to-end BERT-based speech synthesis method is proposed to use the Self-Attention Mechanism instead of RNN as an encoder in the Seq2Seq architecture of speech synthesis.The method uses a pre-trained BERT as the model's Encoder to extract contextual information from the input text content, the Decoder outputs the Mel spectrum by using the same architecture as the speech synthesis model Tacotron2, and finally the trained WaveGlow network is used to transform the Mel spectrum into the final audio result.This method significantly reduces the training parameters and training time by fine-tuning the downstream task based on pre-trained BERT.At the same time, it can also compute the hidden states in the encoder in parallel with its Self-Attention mechanism, thus making full use of the parallel computing power of the GPU to improve the training efficiency and effectively alleviate the remote dependency problem.Through comparison experiments with the Tacotron2 model, the results show that the model proposed in this paper is able to double the training speed while obtaining similar results to the Tacotron2 model.

Key words: Attention mechanism, Recurrent neural network(RNN), Seq2Seq, Speech synthesis, WaveGlow

中图分类号: 

  • TP391
[1] TAYLOR P.Text-to-speech synthesis[M].New York:Cam-bridge University Press,2009.
[2] FUNG P,SCHULTZ T.Multilingual spoken language processing[J].IEEE Signal Processing Magazine,2008,25(3):89-97.
[3] PAN X Q,LU T L,DU Y H,et al.Overview of Speech Synthesis and Voice Converrsion Technology Based on Deep Learning[J].Computer Science,2021,48(8):200-208.
[4] ZHANG B,QUAN C Q,REN J F.Overview of Speech Synthesis in Development and Methods[J].Journal of Chinese Computer System,2016,37(1):186-192.
[5] WANG Y,SKERRY-RYAN R,STANTON D,et al.Tacotron:toward end-to-end speech synthesis[J].arXiv:1703.10135,2017.
[6] GRIFFIN D,LIM J S.Signal estimation from modified short-time Fourier transform[J].1984 IEEE Transactions on Acoustics Speech and Signal Processing,1984,32(2):236-243.
[7] SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[J].Advances in Neural Information Processing Systems,2014,27:3104-3112.
[8] SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2018:4779-4783.
[9] OORD A V D,DIELEMAN,ZEN H,et al.WaveNet:a generative model for raw audio[J].arXiv:1609.03499,2016.
[10] ARIK S O,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural text-to-speech[J].arXiv:1702.07825,2017.
[11] GIBIANSKY A,ARIK S O,DIAMOS G F,et al.Deep Voice 2:Multi-Speaker Neural Text-to-Speech[C]//Proceedings of the Advances in 2017 Neural Information Processing Systems.Uni-ted states:NIPS.2017:2963-2970.
[12] PING W,PENG K,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017.
[13] CHRORWSKI J K,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[J].Advances in Neural Information Processing Systems,2015,28:577-585.
[14] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014.
[15] OORD A,LI Y,BABUSCHKIN I,et al.Parallel wavenet:Fast high-fidelity speech synthesis[C]//Proceedings of the International Conference on Machine Learning.Cambridge MA:JMLR,2018:3918-3926.
[16] PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2019:3617-3621.
[17] KINGMA D P,DHARIWAL P.Glow:Generative flow with invertible 1×1 convolutions[C]//Proceedings of the Advances in 2018 NeuralInformation Processing Systems.United states:NIPS,2018:10215-10224.
[18] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[19] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the Advances in 2017 Neural Information Processing Systems.United states:NIPS,2017:5998-6008.
[20] QIU X,SUN T,XU Y,et al.Pre-trained models for natural language processing:A survey[J].arXiv:2003.08271,2020.
[21] HAO Y,DONG L,WEI F,et al.Visualizing and understanding the effectiveness of BERT[J].arXiv:1908.05620,2019.
[22] PING W,PENG K,CHEN J.Clarinet:Parallel wave generation in end-to-end text-to-speech[J].arXiv:1807.07281,2018.
[23] CUI Y,CHE W,LIU T,et al.Revisiting Pre-Trained Models for Chinese Natural Language Processing[J].arXiv:2004.13922,2020.
[24] KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014.
[1] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[3] 戴禹, 许林峰.
基于文本行匹配的跨图文本阅读方法
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[4] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[5] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[6] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[7] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[8] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[9] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[10] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[11] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[12] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[13] 徐鸣珂, 张帆.
Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法
Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[14] 孟月波, 穆思蓉, 刘光辉, 徐胜军, 韩九强.
基于向量注意力机制GoogLeNet-GMP的行人重识别方法
Person Re-identification Method Based on GoogLeNet-GMP Based on Vector Attention Mechanism
计算机科学, 2022, 49(7): 142-147. https://doi.org/10.11896/jsjkx.210600198
[15] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!