计算机科学 ›› 2022, Vol. 49 ›› Issue (6A): 331-336.doi: 10.11896/jsjkx.210500180

• 图像处理&多媒体技术 • 上一篇    下一篇

智能语音技术端到端框架模型分析和趋势研究

李荪, 曹峰   

  1. 中国信息通信研究院 北京 100191
  • 出版日期:2022-06-10 发布日期:2022-06-08
  • 通讯作者: 曹峰(caofeng@caict.ac.cn)
  • 作者简介:(lisun@caict.cn.cn)

Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology

LI Sun, CAO Feng   

  1. China Academy of Information and Communications Technology,Beijing 100191,China
  • Online:2022-06-10 Published:2022-06-08
  • About author:LI Sun,born in 1988,master, senior engineer.Her main research interests include artificial intelligence policy,stan-dards,and industry research,covering machine learning,perceptual cognitive technology,and intelligent customer service.
    CAO Feng,born in 1986,master,engineer.His main research interests include artificial intelligence evaluation standards,artificial intelligence engineering,robotic process automation and intelligent speech semantics.

摘要: 端到端(End-to-End)框架是一种基于深度神经网络可直接预测语音信号和目标语言字符的概率模型,从原始的数据输入到结果输出,中间的处理过程和神经网络成一体化,可脱离人类主观偏见,直接提取特征,从而充分挖掘数据信息,简化任务处理步骤。近几年,注意力机制的引入,辅助端到端架构实现了多模态间的相互映射,进一步提高了技术的整体性能。通过对近几年端到端技术在智能语音领域技术和应用的调研,端到端架构为语音模型算法提供了新的思想和方法,但也存在混合框架无法有效地平衡和兼顾单一技术特点,模型内部逻辑复杂使得人工介入调试困难、定制可扩展性减弱等问题。未来端到端一体化模型在语音领域应用方面还将有进一步的发展,一方面是前端到后端的模块端到端,忽略前端语音增强和后端语音识别中涉及多项输入的假设,将语音增强和声学建模一体化,另一方面是交互信息载体的端到端,聚焦于语音信号数据本身的信息提取和处理,使得人机交互更贴近真实人类语言的沟通方式。

关键词: 端到端模型, 混合框架, 人机交互, 智能语音

Abstract: The end-to-end framework is a probability model based on the depth neural network which can directly predict the speech signal and the target language character.From the original data input to the result output,the intermediate processing process and neural network are integrated,which can be separated from human subjective bias,directly extract the features,fully mine the data information,and simplify the task processing steps.In recent years,with the introduction of attention mechanism,the auxiliary end-to-end architecture realizes the mutual mapping between multimode,further improving the overall performance of the technology.Through the research on the technology and application of end-to-end technology in the field of intelligent speech in recent years,the end-to-end architecture provides a new idea and method for speech model algorithm,but there are also problems such as the mixed framework can not effectively balance and take into account the single technical characteristics,the complexity of the internal logic of the model makes it difficult for human intervention debugging,and the customization scalability is weakened.In the future,there will be further development in the application of the end-to-end integrated model in the field of speech.On the one hand,the front-end to back-end modules ignore the multiple input assumptions in front-end speech enhancement and back-end speech recognition to integrate speech enhancement and acoustic modeling.On the other hand,the end-to-end interactive information carrier focuses on the information extraction and processing of speech signal data itself the human-compu-ter interaction is closer to the real human language communication.

Key words: End-to-end model, Human-computer interaction, Hybrid framework, Intelligent voice

中图分类号: 

  • TN912.34
[1] DAVIS K H.Automatic Recognition of Spoken Digits[J].Journal of the Acoustical Society of America,1952,24(6):669.
[2] ATAL B S,HANAUER S L.Speech Analysis and Synthesis by Linear Prediction of the Speech Wave[J].J.Acoust.Soc.Am.,1971,50(2):637-655,.
[3] ITAKURA F,SAITO S.A Statistical Method for Estimation of Speech Spectral Density and Format Frequencies[J].Electronics and Communications in Japan,1970,53(A):36-43.
[4] HERMANSKY H.Perceptual Linear predictive(PLP) analysis of speech[J].Journal of the Acoustical Society of America,1990,87(4):1738-1752.
[5] BAUM L E,EGON J A.An inequality with applications to statistical estimation for probabilistic functions of Markov process and to a model for ecology[J].Bull.Amer.Meteorol.Soc.m,1967,73:360-363.
[6] BAUM L E,SHELL G R.Growth functions for transformations on manifolds[J].Pacific Journal of Mathematics,1968,27(2):211-227.
[7] BAUM L E,PETRIE T,SOULES G,et al.A Maximizationtechnique ccurring in statistical analysis of probabilistic functions of Markov chains[J].Ann.Math.Stat.,1970,41(1):164-171.
[8] BAUM L E.An inequality and associated maximization techniques in statistical estimation for probabilistic functions of Markov processes[M].Inequalities,1972,3:1-8.
[9] BOURLARD H A,MORGAN N.Connectionist Speech Recognition:A Hybrid Approach[M].Springer US,1994.
[10] HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[11] GRAVES A,FERNANDEZ S,GOMEZ F,et al.Connectionist Temporal Classification:Labelling Unsegmented Sequence Data with Recurrent Neural Networks[C]//ICML.Pittsburgh,USA,2006.
[12] GRAVES A.Supervised sequence labeling with recurrent neural networks[M].vol.385,Springer,2012.
[13] GRAVES A.Sequence transduction with recurrent neural networks[C]//ICML Representation Learning Worksop.2012.
[14] GRAVES A.Sequence Transduction with Recurrent NeuralNetworks[J].Computerence,2012,58(3):235-242.
[15] CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech reco-gnition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016.
[16] ZHANG B,QUAN C Q,REN F J.Overview of speech synthesis methods and development[J].Minicomputer System,2016,37(1):186-192.
[17] KLATT D H.Software for a cascade/parallel formant synthesizer[J].The Journal of the Acoustical Society of America,1980,67(3):971-971.
[18] BLACK A W,CAMPBELLN.Optimising selection of units from speech databases for concatenative synthesis[J/OL].1996.https://www.researchgate.net/publication/2580972_Optimising_Selection_Of_Units_From_Speech_Databases_For_Concatenative_Synthesis.
[19] MASUKO T,TOKUDA K,KOBAYASHI T,et al.HMM-based speech synthesis with various voice characteristics[J].The Journal of the Acoustical Society of America,1996,100(4):2760-2760.
[20] WANG W F,XU S,XU B.First step towards end-to-end parametric TTS synthesis:Generating spectral parameters with neural attention[C]//Proceedings Interspeech.2016:2243-2247.
[21] VAN DEN OORD A,DIELEMAN S,ZEN H,et al.WaveNet:A Generative Model for Raw Audio[C]//Proceedings of 9th ISCA Speech Synthesis Workshop.Seoul:ISCA,2016:125.
[22] ARIK S O,CHRZANOWSKI M,COATES A,et al.Deep Voice:Real-time Neural Text-to-Speech[C]//Proceedings of the 34th International Conference on Machine Learning(ICMĹ17).Sydney:ACM,2017:195-204.
[23] GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:Multi-speaker neural text-to-speech[C]//Advances in Neural Information Processing Systems.2017:2962-2970.
[24] PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning[J/OL].2017.https://arxiv.org/pdf/1710.07654.pdf.
[25] SOTELO J,MEHRI S,KUMAR K,et al.Char2wav:End-to-End speech synthesis[C]//Proceedings of the ICLR 2017 Workshop.Toulon:ICLR,2017:24-26.
[26] WANG Y,SKERRY-RYAN R,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Interspeech 2017 Stockholm.ISCA,2017:4006-4010.
[27] SHEN J,PAN R,WEISS R J,et al.Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions[C]//Proceedings of 2018 International Conference on Acoustics,Speech,and Signal Processing.Calgary:IEEE,2018:4779-4783.
[28] JIA Y,JOHNSON M,MACHEREY W,et al.Leveraging Wea-kly Supervised Data to Improve End-to-end Speech-to-text Translation[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019.
[29] JIA Y,WEISS R J,BIADSYF,et al.Direct speech-to-speechtranslation with a sequence-to-sequence model[C]//Interspeech 2019.2019.
[30] WANG Y,SKERRY-RYAN R,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Interspeech 2017.2017.
[31] SHEN J,PANG R,WEISS R J,et al.Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2018).IEEE,2018.
[32] PRABHAVALKARR,RAO K,SAINATH T N,et al.A Comparison of Sequence-to-Sequence Models for Speech Recognition[C]//Interspeech 2017.2017.
[33] WATANABE S,HORI T,KARITA S,et al.ESPnet:End-to-End Speech Processing Toolkit[C]//Interspeech 2018.2018.
[34] PHAM N Q,NGUYEN T S,NIEHUES J,et al.Very Deep Self-Attention Networks for End-to-End Speech Recognition[J/OL].2019.https://arxiv.org/abs/1904.13377.
[35] YUAN Z,LYU Z,LI J,et al.An improved hybrid CTC-Attention model for speech recognition[J/OL].2018.https://arxiv.org/abs/1810.12020.
[36] CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech reco-gnition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016.
[37] XIAO X,WATANABE S,ERDOGAN H,et al.Deep beam-forming networks for multi-channel speech recognition[C]//IEEE International Conference on Acoustics.IEEE,2016.
[38] ANUMANCHIPALLI G K,CHARTIER J,CHANG E F.Speech synthesis from neural decoding of spoken sentences[J].Nature,2019,568(7753):493-498.
[39] CHIU C C,SAINATH T N,WU Y,et al.State-of-the-artSpeech Recognition With Sequence-to-Sequence Models[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2018).2017.
[1] 张继凯, 李琦, 王月明, 吕晓琪.
基于单目RGB图像的三维手势跟踪算法综述
Survey of 3D Gesture Tracking Algorithms Based on Monocular RGB Images
计算机科学, 2022, 49(4): 174-187. https://doi.org/10.11896/jsjkx.210700084
[2] 朱晨爽, 程时伟.
虚拟现实环境下基于眼动跟踪的导航需求预测与辅助
Prediction and Assistance of Navigation Demand Based on Eye Tracking in Virtual Reality Environment
计算机科学, 2021, 48(8): 315-321. https://doi.org/10.11896/jsjkx.200500031
[3] 刘亮, 蒲浩洋.
基于LSTM的多维度特征手势实时识别
Real-time LSTM-based Multi-dimensional Features Gesture Recognition
计算机科学, 2021, 48(8): 328-333. https://doi.org/10.11896/jsjkx.210300079
[4] 程时伟, 陈一健, 徐静如, 张柳新, 吴剑锋, 孙凌云.
一种基于脑电信号的眼动方向分类方法
Approach to Classification of Eye Movement Directions Based on EEG Signal
计算机科学, 2020, 47(4): 112-118. https://doi.org/10.11896/jsjkx.190200342
[5] 程时伟, 齐文杰.
基于动态轨迹的眼动跟踪隐式标定方法
Dynamic Trajectory Based Implicit Calibration Method for Eye Tracking
计算机科学, 2019, 46(8): 282-291. https://doi.org/10.11896/j.issn.1002-137X.2019.08.047
[6] 陈甜甜, 姚璜, 左明章, 田元, 杨梦婷.
基于深度信息的动态手势识别综述
Review of Dynamic Gesture Recognition Based on Depth Information
计算机科学, 2018, 45(12): 42-51. https://doi.org/10.11896/j.issn.1002-137X.2018.12.006
[7] 张宁, 刘迎春, 沈智鹏, 郭晨.
虚拟现实技术在专门用途英语教学中的应用研究综述
Review of Virtual Reality Technology Based Application Research in English Teaching for Special Purposes
计算机科学, 2017, 44(Z6): 43-47. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.009
[8] 马丁,庄雷,兰巨龙.
可重构信息通信基础网络端到端模型的研究与探索
Research on End-to-End Model of Reconfigurable Information Communication Basal Network
计算机科学, 2017, 44(6): 114-120. https://doi.org/10.11896/j.issn.1002-137X.2017.06.020
[9] 刘喆,李智.
基于多通道交互技术的计算机辅助需求分析系统的研发
Research and Development of Computer-aided Requirements Engineering Tool Based on Multi-modal Interaction Technologies
计算机科学, 2017, 44(4): 177-181. https://doi.org/10.11896/j.issn.1002-137X.2017.04.039
[10] 易靖国,程江华,库锡树.
视觉手势识别综述
Review of Gestures Recognition Based on Vision
计算机科学, 2016, 43(Z6): 103-108. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.025
[11] 陈勇,徐超.
基于符号执行和人机交互的自动向量化方法
Symbolic Execution and Human-Machine Interaction Based Auto Vectorization Method
计算机科学, 2016, 43(Z6): 461-466. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.109
[12] 辛义忠,马莹,于霞.
直接和间接的笔倾斜输入性能对比
Comparison of Direct and Indirect Pen Tilt Input Performance
计算机科学, 2015, 42(9): 50-55. https://doi.org/10.11896/j.issn.1002-137X.2015.09.011
[13] 高燕飞,陈俊杰,强彦.
Hadoop平台下的动态调度算法
Dynamic Scheduling Algorithm in Hadoop Platform
计算机科学, 2015, 42(9): 45-49. https://doi.org/10.11896/j.issn.1002-137X.2015.09.010
[14] 王玉,任福继,全昌勤.
口语对话系统中对话管理方法研究综述
Review of Dialogue Management Methods in Spoken Dialogue System
计算机科学, 2015, 42(6): 1-7. https://doi.org/10.11896/j.issn.1002-137X.2015.06.001
[15] 何正海,李智.
基于人机交互的计算机辅助软件需求分析工具的研发
Research and Development of Computer-aided Requirements Analysis Tool Based on Human-computer Interaction
计算机科学, 2015, 42(12): 181-183.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!