Computer Science ›› 2022, Vol. 49 ›› Issue (1): 47-52.doi: 10.11896/jsjkx.210900013

• Multilingual Computing Advanced Technology • Previous Articles     Next Articles

Latest Development of Multilingual Speech Recognition Acoustic Model Modeling Methods

CHENG Gao-feng1, YAN Yong-hong1,2   

  1. 1 Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China
    2 School of Electronic,Electrical and Communication Engineering,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2021-09-01 Revised:2021-10-18 Online:2022-01-15 Published:2022-01-18
  • About author:CHENG Gao-feng,born in 1990,Ph.D,assistant professor.His main research interests include speech recognition and deep learning.
    YAN Yong-hong, born in 1967,Ph.D,professor.His main research interests include speech processing and recognition,language/speaker recognition,and human computer interface.

Abstract: With the rapid development of multimedia and communication technology,the amount of multilingual speech data on the Internet is increasing.Speech recognition technology is the core for media analysis and processing.How to quickly expand from a few major languages such as Chinese and English to more languages has become a prominent issue yet to be overcome in order to improve multilingual processing capabilities.This article summarizes the latest progress in the field of acoustic model modeling,and discusses breakthroughs needed by traditional speech recognition technology in the course of moving from single language to multi-languages.The latest end-to-end speech recognition technology was exploited to construct a keyword spotting system,and the system achieves favorable performance.The approach is detailed as follows:1)multi-lingual hierarchical and structured acoustic model modeling method;2)multilingual acoustic modeling based on language classification information;3)end-to-end keyword spotting based on frame-synchronous alignments.

Key words: Acoustic model, Multilingual, Speech recognition

CLC Number: 

  • TP391
[1]HINTON G,DENG L,YU D,et al.Deep neural networks foracoustic modeling in speech recognition:the shared views of four research groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[2]POVEY D,PEDDINTI V,GALVEZ D,et al.Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]//Interspeech.2016:2751-2755.
[3]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.2006:369-376.
[4]LIU C,ZHANG Q,ZHANG X,et al.Multilingual graphemic hybrid ASR with massive data augmentation[J].arXiv:1909.06522,2019.
[5]TONG S,GARNER P N,BOURLARD H.An investigation of multilingual ASR using end-to-end LF-MMI[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2019).IEEE,2019:6061-6065.
[6]TONG S,GARNER P N,BOURLARD H.Cross-lingual adaptation of a CTC-based multilingual acoustic model[J].Speech Communication,2018,104:39-46.
[7]TONG S,GARNER P N,BOURLARD H.Fast LanguageAdaptation Using Phonological Information[C]//INTERSPEECH.2018:2459-2463.
[8]HSU J Y,CHEN Y J,LEE H.Meta learning for end-to-end low-resource speech recognition[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Proces-sing (ICASSP).IEEE,2020:7844-7848.
[9]DALMIA S,SANABRIA R,METZE F,et al.Sequence-basedmulti-lingual low resource speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:4909-4913.
[10]CHEN Y C,HSU J Y,LEE C K,et al.DARTS-ASR:Differen-tiable architecture search for multilingual speech recognition and adaptation[J].arXiv:2005.07029,2020.
[11]THOMAS S,AUDHKHASI K,KINGSBURY B.Translitera-tion Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings[C]//INTERSPE-ECH.2020:4736-4740.
[12]GRAVES A.Sequence transduction with recurrent neural networks[J].arXiv:1211.3711,2012.
[13]CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech re-cognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2016:4960-4964.
[14]PRATAP V,SRIRAM A,TOMASELLO P,et al.Massivelymultilingual ASR:50 languages,1 model,1 billion parameters[J].arXiv:2007.03001,2020.
[15]LI B,PANG R,SAINATH T N,et al.Scaling end-to-end models for large-scale multilingual asr[J].arXiv:2104.14830,2021.
[16]DATTA A,RAMABHADRAN B,EMOND J,et al.Langua-ge agnostic multilingual modeling[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2020:8239-8243.
[17]KARAFIÁT M,BASKAR M K,WATANABE S,et al.Analysis of multilingual sequence-to-sequence speech recognition systems[J].arXiv:1811.03451,2018.
[18]ADAMS O,WIESNER M,WATANABE S,et al.Massivelymultilingual adversarial speech recognition[J].arXiv:1904.02210,2019.
[19]CHO J,BASKAR M K,LI R,et al.Multilingual sequence-to-sequence speech recognition:architecture,transfer learning,and language modeling[C]//2018 IEEE Spoken Language Techno-logy Workshop (SLT).IEEE,2018:521-527.
[20]ZHOU S,XU S,XU B.Multilingual end-to-end speech recognition with a single transformer on low-resource languages[J].arXiv:1806.05059,2018.
[21]LI B,ZHANG Y,SAINATH T,et al.Bytes are all you need:end-to-end multilingual speech recognition and synthesis with bytes[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2019:5621-5625.
[22]HOU W,DONG Y,ZHUANG B,et al.Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning[C]//INTERSPEECH.2020:1037-1041.
[23]WATANABE S,HORI T,HERSHEY J R.Language indepen-dent end-to-end architecture for joint language identification and speech recognition[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2017:265-271.
[24]POVEY D,GHOSHAL A,BOULIANNE G,et al.The Kaldispeech recognition toolkit[C]//IEEE 2011 workshop on automatic speech recognition and understanding.IEEE Signal Processing Society,2011.
[25]CAI W,CAI Z,ZHANG X,et al.A novel learnable dictionary encoding layer for end-to-end language identification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5189-5193.
[26]CAI W,CAI Z,LIU W,et al.Insights in-to-end learning scheme for language identification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5209-5213.
[27]MIAO X,MCLOUGHLIN I.Lstm-tdnn with convolutionalfront-end for dialect identification in the 2019 multi-genre broadcast challenge[J].arXiv:1912.09003,2019.
[28]MIAO X,MCLOUGHLIN I,YAN Y.A New Time-Frequency Attention Tensor Network for Language Identification[J].Circuits,Systems,and Signal Processing,2020,39(5):2744-2758.
[29]BEDYAKIN R,MIKHAYLOVSKIY N.Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions[J].arXiv:2106.00052,2021.
[30]TJANDRA A,CHOUDHURY D G,ZHANG F,et al.Improved language identification through cross-lingual self-supervised learning[J].arXiv:2107.04082,2021.
[31]KANNAN A,DATTA A,SAINATH T N,et al.Large-scalemultilingual speech recognition with a streaming end-to-end model[C]//Proc.Interspeech 2019,2019:2130-2134.
[32]TOSHNIWAL S,SAINATH T N,WEISS R J,et al.Multilingual speech recognition with a single end-to-end model[C]//2018 IEEE international conference on acoustics,speech and signal processing (ICASSP).IEEE,2018:4904-4908.
[33]PUNJABI S,ARSIKERE H,RAEESY Z,et al.Streaming end-to-end bilingual asr systems with joint language identification[J].arXiv:2007.03900,2020.
[34]MIILLER M,STIIKER S,WAIBEL A.Multilingual adaptation of RNN based ASR systems[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5219-5223.
[35]SEKI H,WATANABE S,HORI T,et al.An end-to-end language-tracking speech recognizer for mixed-language speech[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:4919-4923.
[36]WATERS A,GAUR N,HAGHANI P,et al.Leveraging lan-guage id in multilingual end-to-end speech recognition[C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2019:928-935.
[37]PUNJABI S,ARSIKERE H,RAEESY Z,et al.Joint ASR and language identification using RNN-T:An efficient approach to dynamic language switching[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Proces-sing (ICASSP).IEEE,2021:7218-7222.
[38]LIU D,WAN X,XU J,et al.Multilingual Speech Recognition Training and Adaptation with Language-Specific Gate Units[C]//2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).IEEE,2018:86-90.
[39]LIU D,XU J,ZHANG P,et al.A unified system for multilingual speech recognition and language identification[J].Speech Communication,2021,127:17-28.
[40]LIU D,XU J,ZHANG P.End-to-End Multilingual Speech Re-cognition System with Language Supervision Training[J].IEICE TRANSACTIONS on Information and Systems,2020,103(6):1427-1430.
[41]KIM S,SELTZER M L.Towards language-universal end-to-end speech recognition[C]//Proc.of the IEEE International Confe-rence on Acoustics,Speech and Signal Processing.2018:4914-4918.
[42]YI J,TAO J,WEN Z,et al.Adversarial multilingual training for low-resource speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:4899-4903.
[43]YI J,TAO J,WEN Z,et al.Language-adversarial transfer lear-ning for low-resource speech recognition[J].IEEE/ACM Tran-sactions on Audio,Speech,and Language Processing,2018,27(3):621-630.
[44]STOLCKE A.Srilm-an extensible language modeling toolkit[C]//Proc.of the International Conference on Spoken Language Processing.2002:901-904.
[45]WELLS J.SAMPA computer readable phonetic alphabet[M]//Handbook of Standards and Resources for Spoken Language Systems.Berlin and New York:Mouton de Gruyter,1997.
[46]HAMPSHIRE W.A novel objective function for improved phoneme recognition using time delay neural networks[C]//Proc.of the International 1989 Joint Conference on Neural Networks.1989:235-241.
[47]WAIBEL A,HANAZAWA T,HINTON G,et al.Phoneme re-cognition using time-delay neural networks[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1989,37(3):328-339.
[48]HAMPSHIRE J B,WAIBEL A H.A novel objective functionfor improved phoneme recognition using time-delay neural networks[J].IEEE Transactions on Neural Networks,1990,1(2):216-228.
[49]CHOROWSKI J,BAHDANAU D,SERDYUK D,et al.Atten-tion-based models for speech recognition[C]//Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems 2015.2015:577-585.
[50]LI J,YE G,DAS A,et al.Advancing acoustic-to-word CTCmodel[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5794-5798.
[51]YUAN Y,LEUNG C C,XIE L,et al.Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2017:5645-5649.
[52]RAM D,MICULICICH L,BOURLARD H.Multilingual bottleneck features for query by example spoken term detection[C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2019:621-628.
[53]RAM D,MICULICICH L,BOURLARD H.Neural networkbased end-to-end query by example spoken term detection[J].IEEE/ACM Transactionson Audio,Speech,and Language Processing,2020,28:1416-1427.
[54]WATANABE S,HORI T,KIM S,et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1240-1253.
[55]WATANABE S,HORI T,KARITA S,et al.Espnet:end-to-end speech processing toolkit[C]//Interspeech.2018:2207-2211.
[56]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in neural information processing systems.2017:5998-6008.
[57]GAGE P.A new algorithm for data compression[J].C Users Journal,1994,12(2):23-38.
[58]SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[J].arXiv:1508.07909,2015.
[1] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[2] LIU Jun-peng, SU Jin-song, HUANG De-gen. Incorporating Language-specific Adapter into Multilingual Neural Machine Translation [J]. Computer Science, 2022, 49(1): 17-23.
[3] YU Dong, XIE Wan-ying, GU Shu-hao, FENG Yang. Similarity-based Curriculum Learning for Multilingual Neural Machine Translation [J]. Computer Science, 2022, 49(1): 24-30.
[4] YANG Run-yan, CHENG Gao-feng, LIU Jian. Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition [J]. Computer Science, 2022, 49(1): 53-58.
[5] LIU Chuang, XIONG De-yi. Survey of Multilingual Question Answering [J]. Computer Science, 2022, 49(1): 65-72.
[6] ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks [J]. Computer Science, 2020, 47(5): 110-119.
[7] CUI Yang, LIU Chang-hong. PIFA-based Evaluation Platform for Speech Recognition System [J]. Computer Science, 2020, 47(11A): 638-641.
[8] ZHANG Jing, YANG Jian, SU Peng. Survey of Monosyllable Recognition in Speech Recognition [J]. Computer Science, 2020, 47(11A): 172-174.
[9] SHI Yan-yan, BAI Jing. Speech Recognition Combining CFCC and Teager Energy Operators Cepstral Coefficients [J]. Computer Science, 2019, 46(5): 286-289.
[10] LONG Xing-yan, QU Dan, ZHANG Wen-lin. Attention Based Acoustics Model Combining Bottleneck Feature LONG Xing-yan QU Dan ZHANG Wen-lin [J]. Computer Science, 2019, 46(1): 260-264.
[11] ZHANG Ai-ying. Research on Low-resource Mongolian Speech Recognition Based on Multilingual Speech Data Selection [J]. Computer Science, 2018, 45(9): 308-313.
[12] ZHANG Ai-ying and NI Chong-jia. Research on Low-resource Mongolian Speech Recognition [J]. Computer Science, 2017, 44(10): 318-322.
[13] LI Wei-lin, WEN Jian and MA Wen-kai. Speech Recognition System Based on Deep Neural Network [J]. Computer Science, 2016, 43(Z11): 45-49.
[14] WEI Ying, WANG Shuang-wei, PAN Di, ZHANG Ling, XU Ting-fa and LIANG Shi-li. Specific Two Words Chinese Lexical Recognition Based on Broadband and Narrowband Spectrogram Feature Fusion with Zoning Projection [J]. Computer Science, 2016, 43(Z11): 215-219.
[15] SUN Zhi-yuan, LU Cheng-xiang, SHI Zhong-zhi and MA Gang. Research and Advances on Deep Learning [J]. Computer Science, 2016, 43(2): 1-8.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!