计算机科学 ›› 2022, Vol. 49 ›› Issue (4): 221-226.doi: 10.11896/jsjkx.210300071
安鑫, 代子彪, 李阳, 孙晓, 任福继
AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji
摘要: 针对基于RNN的神经网络语音合成模型训练和预测效率低下以及长距离信息丢失的问题,提出了一种基于BERT的端到端语音合成方法,在语音合成的Seq2Seq架构中使用自注意力机制(Self-Attention Mechanism)取代RNN作为编码器。该方法使用预训练好的BERT作为模型的编码器(Encoder)从输入的文本内容中提取上下文信息,解码器(Decoder)采用与语音合成模型Tacotron2相同的架构输出梅尔频谱,最后使用训练好的WaveGlow网络将梅尔频谱转化为最终的音频结果。该方法在预训练BERT的基础上通过微调适配下游任务来大幅度减少训练参数和训练时间。同时,借助其自注意力(Self-Attention)机制还可以并行计算编码器中的隐藏状态,从而充分利用GPU的并行计算能力以提高训练效率,并能有效缓解远程依赖问题。与Tacotron2模型的对比实验表明,文中提出的模型能够在得到与Tacotron2模型相近效果的基础上,把训练速度提升1倍左右。
中图分类号:
[1] TAYLOR P.Text-to-speech synthesis[M].New York:Cam-bridge University Press,2009. [2] FUNG P,SCHULTZ T.Multilingual spoken language processing[J].IEEE Signal Processing Magazine,2008,25(3):89-97. [3] PAN X Q,LU T L,DU Y H,et al.Overview of Speech Synthesis and Voice Converrsion Technology Based on Deep Learning[J].Computer Science,2021,48(8):200-208. [4] ZHANG B,QUAN C Q,REN J F.Overview of Speech Synthesis in Development and Methods[J].Journal of Chinese Computer System,2016,37(1):186-192. [5] WANG Y,SKERRY-RYAN R,STANTON D,et al.Tacotron:toward end-to-end speech synthesis[J].arXiv:1703.10135,2017. [6] GRIFFIN D,LIM J S.Signal estimation from modified short-time Fourier transform[J].1984 IEEE Transactions on Acoustics Speech and Signal Processing,1984,32(2):236-243. [7] SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[J].Advances in Neural Information Processing Systems,2014,27:3104-3112. [8] SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2018:4779-4783. [9] OORD A V D,DIELEMAN,ZEN H,et al.WaveNet:a generative model for raw audio[J].arXiv:1609.03499,2016. [10] ARIK S O,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural text-to-speech[J].arXiv:1702.07825,2017. [11] GIBIANSKY A,ARIK S O,DIAMOS G F,et al.Deep Voice 2:Multi-Speaker Neural Text-to-Speech[C]//Proceedings of the Advances in 2017 Neural Information Processing Systems.Uni-ted states:NIPS.2017:2963-2970. [12] PING W,PENG K,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017. [13] CHRORWSKI J K,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[J].Advances in Neural Information Processing Systems,2015,28:577-585. [14] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014. [15] OORD A,LI Y,BABUSCHKIN I,et al.Parallel wavenet:Fast high-fidelity speech synthesis[C]//Proceedings of the International Conference on Machine Learning.Cambridge MA:JMLR,2018:3918-3926. [16] PRENGER R,VALLE R,CATANZARO B.Waveglow:A flow-based generative network for speech synthesis[C]//Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2019:3617-3621. [17] KINGMA D P,DHARIWAL P.Glow:Generative flow with invertible 1×1 convolutions[C]//Proceedings of the Advances in 2018 NeuralInformation Processing Systems.United states:NIPS,2018:10215-10224. [18] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [19] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the Advances in 2017 Neural Information Processing Systems.United states:NIPS,2017:5998-6008. [20] QIU X,SUN T,XU Y,et al.Pre-trained models for natural language processing:A survey[J].arXiv:2003.08271,2020. [21] HAO Y,DONG L,WEI F,et al.Visualizing and understanding the effectiveness of BERT[J].arXiv:1908.05620,2019. [22] PING W,PENG K,CHEN J.Clarinet:Parallel wave generation in end-to-end text-to-speech[J].arXiv:1807.07281,2018. [23] CUI Y,CHE W,LIU T,et al.Revisiting Pre-Trained Models for Chinese Natural Language Processing[J].arXiv:2004.13922,2020. [24] KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014. |
[1] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[2] | 周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085 |
[3] | 戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032 |
[4] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[5] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[6] | 汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188 |
[7] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[8] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[9] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[10] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[11] | 张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036 |
[12] | 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224 |
[13] | 徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085 |
[14] | 孟月波, 穆思蓉, 刘光辉, 徐胜军, 韩九强. 基于向量注意力机制GoogLeNet-GMP的行人重识别方法 Person Re-identification Method Based on GoogLeNet-GMP Based on Vector Attention Mechanism 计算机科学, 2022, 49(7): 142-147. https://doi.org/10.11896/jsjkx.210600198 |
[15] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
|