Computer Science ›› 2022, Vol. 49 ›› Issue (6A): 331-336.doi: 10.11896/jsjkx.210500180

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology

LI Sun, CAO Feng   

  1. China Academy of Information and Communications Technology,Beijing 100191,China
  • Online:2022-06-10 Published:2022-06-08
  • About author:LI Sun,born in 1988,master, senior engineer.Her main research interests include artificial intelligence policy,stan-dards,and industry research,covering machine learning,perceptual cognitive technology,and intelligent customer service.
    CAO Feng,born in 1986,master,engineer.His main research interests include artificial intelligence evaluation standards,artificial intelligence engineering,robotic process automation and intelligent speech semantics.

Abstract: The end-to-end framework is a probability model based on the depth neural network which can directly predict the speech signal and the target language character.From the original data input to the result output,the intermediate processing process and neural network are integrated,which can be separated from human subjective bias,directly extract the features,fully mine the data information,and simplify the task processing steps.In recent years,with the introduction of attention mechanism,the auxiliary end-to-end architecture realizes the mutual mapping between multimode,further improving the overall performance of the technology.Through the research on the technology and application of end-to-end technology in the field of intelligent speech in recent years,the end-to-end architecture provides a new idea and method for speech model algorithm,but there are also problems such as the mixed framework can not effectively balance and take into account the single technical characteristics,the complexity of the internal logic of the model makes it difficult for human intervention debugging,and the customization scalability is weakened.In the future,there will be further development in the application of the end-to-end integrated model in the field of speech.On the one hand,the front-end to back-end modules ignore the multiple input assumptions in front-end speech enhancement and back-end speech recognition to integrate speech enhancement and acoustic modeling.On the other hand,the end-to-end interactive information carrier focuses on the information extraction and processing of speech signal data itself the human-compu-ter interaction is closer to the real human language communication.

Key words: End-to-end model, Human-computer interaction, Hybrid framework, Intelligent voice

CLC Number: 

  • TN912.34
[1] DAVIS K H.Automatic Recognition of Spoken Digits[J].Journal of the Acoustical Society of America,1952,24(6):669.
[2] ATAL B S,HANAUER S L.Speech Analysis and Synthesis by Linear Prediction of the Speech Wave[J].J.Acoust.Soc.Am.,1971,50(2):637-655,.
[3] ITAKURA F,SAITO S.A Statistical Method for Estimation of Speech Spectral Density and Format Frequencies[J].Electronics and Communications in Japan,1970,53(A):36-43.
[4] HERMANSKY H.Perceptual Linear predictive(PLP) analysis of speech[J].Journal of the Acoustical Society of America,1990,87(4):1738-1752.
[5] BAUM L E,EGON J A.An inequality with applications to statistical estimation for probabilistic functions of Markov process and to a model for ecology[J].Bull.Amer.Meteorol.Soc.m,1967,73:360-363.
[6] BAUM L E,SHELL G R.Growth functions for transformations on manifolds[J].Pacific Journal of Mathematics,1968,27(2):211-227.
[7] BAUM L E,PETRIE T,SOULES G,et al.A Maximizationtechnique ccurring in statistical analysis of probabilistic functions of Markov chains[J].Ann.Math.Stat.,1970,41(1):164-171.
[8] BAUM L E.An inequality and associated maximization techniques in statistical estimation for probabilistic functions of Markov processes[M].Inequalities,1972,3:1-8.
[9] BOURLARD H A,MORGAN N.Connectionist Speech Recognition:A Hybrid Approach[M].Springer US,1994.
[10] HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[11] GRAVES A,FERNANDEZ S,GOMEZ F,et al.Connectionist Temporal Classification:Labelling Unsegmented Sequence Data with Recurrent Neural Networks[C]//ICML.Pittsburgh,USA,2006.
[12] GRAVES A.Supervised sequence labeling with recurrent neural networks[M].vol.385,Springer,2012.
[13] GRAVES A.Sequence transduction with recurrent neural networks[C]//ICML Representation Learning Worksop.2012.
[14] GRAVES A.Sequence Transduction with Recurrent NeuralNetworks[J].Computerence,2012,58(3):235-242.
[15] CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech reco-gnition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016.
[16] ZHANG B,QUAN C Q,REN F J.Overview of speech synthesis methods and development[J].Minicomputer System,2016,37(1):186-192.
[17] KLATT D H.Software for a cascade/parallel formant synthesizer[J].The Journal of the Acoustical Society of America,1980,67(3):971-971.
[18] BLACK A W,CAMPBELLN.Optimising selection of units from speech databases for concatenative synthesis[J/OL].1996.https://www.researchgate.net/publication/2580972_Optimising_Selection_Of_Units_From_Speech_Databases_For_Concatenative_Synthesis.
[19] MASUKO T,TOKUDA K,KOBAYASHI T,et al.HMM-based speech synthesis with various voice characteristics[J].The Journal of the Acoustical Society of America,1996,100(4):2760-2760.
[20] WANG W F,XU S,XU B.First step towards end-to-end parametric TTS synthesis:Generating spectral parameters with neural attention[C]//Proceedings Interspeech.2016:2243-2247.
[21] VAN DEN OORD A,DIELEMAN S,ZEN H,et al.WaveNet:A Generative Model for Raw Audio[C]//Proceedings of 9th ISCA Speech Synthesis Workshop.Seoul:ISCA,2016:125.
[22] ARIK S O,CHRZANOWSKI M,COATES A,et al.Deep Voice:Real-time Neural Text-to-Speech[C]//Proceedings of the 34th International Conference on Machine Learning(ICMĹ17).Sydney:ACM,2017:195-204.
[23] GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:Multi-speaker neural text-to-speech[C]//Advances in Neural Information Processing Systems.2017:2962-2970.
[24] PING W,PENG K,GIBIANSKY A,et al.Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning[J/OL].2017.https://arxiv.org/pdf/1710.07654.pdf.
[25] SOTELO J,MEHRI S,KUMAR K,et al.Char2wav:End-to-End speech synthesis[C]//Proceedings of the ICLR 2017 Workshop.Toulon:ICLR,2017:24-26.
[26] WANG Y,SKERRY-RYAN R,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Interspeech 2017 Stockholm.ISCA,2017:4006-4010.
[27] SHEN J,PAN R,WEISS R J,et al.Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions[C]//Proceedings of 2018 International Conference on Acoustics,Speech,and Signal Processing.Calgary:IEEE,2018:4779-4783.
[28] JIA Y,JOHNSON M,MACHEREY W,et al.Leveraging Wea-kly Supervised Data to Improve End-to-end Speech-to-text Translation[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019.
[29] JIA Y,WEISS R J,BIADSYF,et al.Direct speech-to-speechtranslation with a sequence-to-sequence model[C]//Interspeech 2019.2019.
[30] WANG Y,SKERRY-RYAN R,STANTON D,et al.Tacotron:Towards End-to-End Speech Synthesis[C]//Interspeech 2017.2017.
[31] SHEN J,PANG R,WEISS R J,et al.Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2018).IEEE,2018.
[32] PRABHAVALKARR,RAO K,SAINATH T N,et al.A Comparison of Sequence-to-Sequence Models for Speech Recognition[C]//Interspeech 2017.2017.
[33] WATANABE S,HORI T,KARITA S,et al.ESPnet:End-to-End Speech Processing Toolkit[C]//Interspeech 2018.2018.
[34] PHAM N Q,NGUYEN T S,NIEHUES J,et al.Very Deep Self-Attention Networks for End-to-End Speech Recognition[J/OL].2019.https://arxiv.org/abs/1904.13377.
[35] YUAN Z,LYU Z,LI J,et al.An improved hybrid CTC-Attention model for speech recognition[J/OL].2018.https://arxiv.org/abs/1810.12020.
[36] CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech reco-gnition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016.
[37] XIAO X,WATANABE S,ERDOGAN H,et al.Deep beam-forming networks for multi-channel speech recognition[C]//IEEE International Conference on Acoustics.IEEE,2016.
[38] ANUMANCHIPALLI G K,CHARTIER J,CHANG E F.Speech synthesis from neural decoding of spoken sentences[J].Nature,2019,568(7753):493-498.
[39] CHIU C C,SAINATH T N,WU Y,et al.State-of-the-artSpeech Recognition With Sequence-to-Sequence Models[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2018).2017.
[1] ZHANG Ji-kai, LI Qi, WANG Yue-ming, LYU Xiao-qi. Survey of 3D Gesture Tracking Algorithms Based on Monocular RGB Images [J]. Computer Science, 2022, 49(4): 174-187.
[2] LIU Liang, PU Hao-yang. Real-time LSTM-based Multi-dimensional Features Gesture Recognition [J]. Computer Science, 2021, 48(8): 328-333.
[3] CHENG Shi-wei, CHEN Yi-jian, XU Jing-ru, ZHANG Liu-xin, WU Jian-feng, SUN Ling-yun. Approach to Classification of Eye Movement Directions Based on EEG Signal [J]. Computer Science, 2020, 47(4): 112-118.
[4] CHENG Shi-wei, QI Wen-jie. Dynamic Trajectory Based Implicit Calibration Method for Eye Tracking [J]. Computer Science, 2019, 46(8): 282-291.
[5] CHEN Tian-tian, YAO Huang, ZUO Ming-zhang, TIAN Yuan, YANG Meng-ting. Review of Dynamic Gesture Recognition Based on Depth Information [J]. Computer Science, 2018, 45(12): 42-51.
[6] MA Ding, ZHUANG Lei and LAN Ju-long. Research on End-to-End Model of Reconfigurable Information Communication Basal Network [J]. Computer Science, 2017, 44(6): 114-120.
[7] LIU Zhe and LI Zhi. Research and Development of Computer-aided Requirements Engineering Tool Based on Multi-modal Interaction Technologies [J]. Computer Science, 2017, 44(4): 177-181.
[8] XIN Yi-zhong, MA Ying and YU Xia. Comparison of Direct and Indirect Pen Tilt Input Performance [J]. Computer Science, 2015, 42(9): 50-55.
[9] GAO Yan-fei, CHEN Jun-jie and QIANG Yan. Dynamic Scheduling Algorithm in Hadoop Platform [J]. Computer Science, 2015, 42(9): 45-49.
[10] WANG Yu, REN Fu-ji and QUAN Chang-qin. Review of Dialogue Management Methods in Spoken Dialogue System [J]. Computer Science, 2015, 42(6): 1-7.
[11] HE Zheng-hai and LI Zhi. Research and Development of Computer-aided Requirements Analysis Tool Based on Human-computer Interaction [J]. Computer Science, 2015, 42(12): 181-183.
[12] LIU Xin-chen,FU Hui-yuan and MA Hua-dong. Real-time Fingertip Tracking and Gesture Recognition Using RGB-D Camera [J]. Computer Science, 2014, 41(10): 50-52.
[13] GAO Zeng-gui,SUN Shou-qian,ZHANG Ke-jun,SHE Duo-chun and YANG Zhong-liang. Gait Data System and Joint Movement Recognition Model for Human-exoskeleton Interaction [J]. Computer Science, 2014, 41(10): 42-44.
[14] CAO Juan,ZHANG Ying-chun and ZHAO Ling. Human-computer Hybrid Algorithm and its Application in Constrained Layout [J]. Computer Science, 2013, 40(7): 226-228.
[15] . Design of the Sign Language Template Library Based on Index Structure [J]. Computer Science, 2012, 39(12): 195-197.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!