多语言语音识别声学模型建模方法最新进展

doi:10.11896/jsjkx.210900013

摘要/Abstract

摘要： 随着多媒体信息和通信技术的快速发展,网络上的多语言语音数据日益增多。语音识别作为语音分析与处理的核心技术,如何快速地把中文和英文等少数多资源主要语言处理能力推广到更多的低资源语言,是当前识别技术迫切需要突破的瓶颈。文中试图总结声学模型建模领域的最新进展,探讨传统语音识别技术从单语言向多语言跨越过程中可能面临的困难。并在此基础之上,探索了最新的端到端语音识别技术在关键词检索系统构建上的作用,以进一步改善系统的整体效果。最后总结了如下最新研究进展:1)基于模型参数共享的多语言声学建模;2)基于语种分类信息的多语言声学建模;3)基于帧级别对齐的端到端关键词检索技术。

关键词: 多语言, 声学模型, 语音识别

Abstract: With the rapid development of multimedia and communication technology,the amount of multilingual speech data on the Internet is increasing.Speech recognition technology is the core for media analysis and processing.How to quickly expand from a few major languages such as Chinese and English to more languages has become a prominent issue yet to be overcome in order to improve multilingual processing capabilities.This article summarizes the latest progress in the field of acoustic model modeling,and discusses breakthroughs needed by traditional speech recognition technology in the course of moving from single language to multi-languages.The latest end-to-end speech recognition technology was exploited to construct a keyword spotting system,and the system achieves favorable performance.The approach is detailed as follows:1)multi-lingual hierarchical and structured acoustic model modeling method;2)multilingual acoustic modeling based on language classification information;3)end-to-end keyword spotting based on frame-synchronous alignments.

Key words: Acoustic model, Multilingual, Speech recognition

中图分类号:

TP391

程高峰, 颜永红. 多语言语音识别声学模型建模方法最新进展[J]. 计算机科学, 2022, 49(1): 47-52. https://doi.org/10.11896/jsjkx.210900013

CHENG Gao-feng, YAN Yong-hong. Latest Development of Multilingual Speech Recognition Acoustic Model Modeling Methods[J]. Computer Science, 2022, 49(1): 47-52. https://doi.org/10.11896/jsjkx.210900013

参考文献

[1]HINTON G,DENG L,YU D,et al.Deep neural networks foracoustic modeling in speech recognition:the shared views of four research groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[2]POVEY D,PEDDINTI V,GALVEZ D,et al.Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]//Interspeech.2016:2751-2755.
[3]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.2006:369-376.
[4]LIU C,ZHANG Q,ZHANG X,et al.Multilingual graphemic hybrid ASR with massive data augmentation[J].arXiv:1909.06522,2019.
[5]TONG S,GARNER P N,BOURLARD H.An investigation of multilingual ASR using end-to-end LF-MMI[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2019).IEEE,2019:6061-6065.
[6]TONG S,GARNER P N,BOURLARD H.Cross-lingual adaptation of a CTC-based multilingual acoustic model[J].Speech Communication,2018,104:39-46.
[7]TONG S,GARNER P N,BOURLARD H.Fast LanguageAdaptation Using Phonological Information[C]//INTERSPEECH.2018:2459-2463.
[8]HSU J Y,CHEN Y J,LEE H.Meta learning for end-to-end low-resource speech recognition[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Proces-sing (ICASSP).IEEE,2020:7844-7848.
[9]DALMIA S,SANABRIA R,METZE F,et al.Sequence-basedmulti-lingual low resource speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:4909-4913.
[10]CHEN Y C,HSU J Y,LEE C K,et al.DARTS-ASR:Differen-tiable architecture search for multilingual speech recognition and adaptation[J].arXiv:2005.07029,2020.
[11]THOMAS S,AUDHKHASI K,KINGSBURY B.Translitera-tion Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings[C]//INTERSPE-ECH.2020:4736-4740.
[12]GRAVES A.Sequence transduction with recurrent neural networks[J].arXiv:1211.3711,2012.
[13]CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech re-cognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2016:4960-4964.
[14]PRATAP V,SRIRAM A,TOMASELLO P,et al.Massivelymultilingual ASR:50 languages,1 model,1 billion parameters[J].arXiv:2007.03001,2020.
[15]LI B,PANG R,SAINATH T N,et al.Scaling end-to-end models for large-scale multilingual asr[J].arXiv:2104.14830,2021.
[16]DATTA A,RAMABHADRAN B,EMOND J,et al.Langua-ge agnostic multilingual modeling[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2020:8239-8243.
[17]KARAFIÁT M,BASKAR M K,WATANABE S,et al.Analysis of multilingual sequence-to-sequence speech recognition systems[J].arXiv:1811.03451,2018.
[18]ADAMS O,WIESNER M,WATANABE S,et al.Massivelymultilingual adversarial speech recognition[J].arXiv:1904.02210,2019.
[19]CHO J,BASKAR M K,LI R,et al.Multilingual sequence-to-sequence speech recognition:architecture,transfer learning,and language modeling[C]//2018 IEEE Spoken Language Techno-logy Workshop (SLT).IEEE,2018:521-527.
[20]ZHOU S,XU S,XU B.Multilingual end-to-end speech recognition with a single transformer on low-resource languages[J].arXiv:1806.05059,2018.
[21]LI B,ZHANG Y,SAINATH T,et al.Bytes are all you need:end-to-end multilingual speech recognition and synthesis with bytes[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2019:5621-5625.
[22]HOU W,DONG Y,ZHUANG B,et al.Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning[C]//INTERSPEECH.2020:1037-1041.
[23]WATANABE S,HORI T,HERSHEY J R.Language indepen-dent end-to-end architecture for joint language identification and speech recognition[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2017:265-271.
[24]POVEY D,GHOSHAL A,BOULIANNE G,et al.The Kaldispeech recognition toolkit[C]//IEEE 2011 workshop on automatic speech recognition and understanding.IEEE Signal Processing Society,2011.
[25]CAI W,CAI Z,ZHANG X,et al.A novel learnable dictionary encoding layer for end-to-end language identification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5189-5193.
[26]CAI W,CAI Z,LIU W,et al.Insights in-to-end learning scheme for language identification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5209-5213.
[27]MIAO X,MCLOUGHLIN I.Lstm-tdnn with convolutionalfront-end for dialect identification in the 2019 multi-genre broadcast challenge[J].arXiv:1912.09003,2019.
[28]MIAO X,MCLOUGHLIN I,YAN Y.A New Time-Frequency Attention Tensor Network for Language Identification[J].Circuits,Systems,and Signal Processing,2020,39(5):2744-2758.
[29]BEDYAKIN R,MIKHAYLOVSKIY N.Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions[J].arXiv:2106.00052,2021.
[30]TJANDRA A,CHOUDHURY D G,ZHANG F,et al.Improved language identification through cross-lingual self-supervised learning[J].arXiv:2107.04082,2021.
[31]KANNAN A,DATTA A,SAINATH T N,et al.Large-scalemultilingual speech recognition with a streaming end-to-end model[C]//Proc.Interspeech 2019,2019:2130-2134.
[32]TOSHNIWAL S,SAINATH T N,WEISS R J,et al.Multilingual speech recognition with a single end-to-end model[C]//2018 IEEE international conference on acoustics,speech and signal processing (ICASSP).IEEE,2018:4904-4908.
[33]PUNJABI S,ARSIKERE H,RAEESY Z,et al.Streaming end-to-end bilingual asr systems with joint language identification[J].arXiv:2007.03900,2020.
[34]MIILLER M,STIIKER S,WAIBEL A.Multilingual adaptation of RNN based ASR systems[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5219-5223.
[35]SEKI H,WATANABE S,HORI T,et al.An end-to-end language-tracking speech recognizer for mixed-language speech[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:4919-4923.
[36]WATERS A,GAUR N,HAGHANI P,et al.Leveraging lan-guage id in multilingual end-to-end speech recognition[C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2019:928-935.
[37]PUNJABI S,ARSIKERE H,RAEESY Z,et al.Joint ASR and language identification using RNN-T:An efficient approach to dynamic language switching[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Proces-sing (ICASSP).IEEE,2021:7218-7222.
[38]LIU D,WAN X,XU J,et al.Multilingual Speech Recognition Training and Adaptation with Language-Specific Gate Units[C]//2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).IEEE,2018:86-90.
[39]LIU D,XU J,ZHANG P,et al.A unified system for multilingual speech recognition and language identification[J].Speech Communication,2021,127:17-28.
[40]LIU D,XU J,ZHANG P.End-to-End Multilingual Speech Re-cognition System with Language Supervision Training[J].IEICE TRANSACTIONS on Information and Systems,2020,103(6):1427-1430.
[41]KIM S,SELTZER M L.Towards language-universal end-to-end speech recognition[C]//Proc.of the IEEE International Confe-rence on Acoustics,Speech and Signal Processing.2018:4914-4918.
[42]YI J,TAO J,WEN Z,et al.Adversarial multilingual training for low-resource speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:4899-4903.
[43]YI J,TAO J,WEN Z,et al.Language-adversarial transfer lear-ning for low-resource speech recognition[J].IEEE/ACM Tran-sactions on Audio,Speech,and Language Processing,2018,27(3):621-630.
[44]STOLCKE A.Srilm－an extensible language modeling toolkit[C]//Proc.of the International Conference on Spoken Language Processing.2002:901-904.
[45]WELLS J.SAMPA computer readable phonetic alphabet[M]//Handbook of Standards and Resources for Spoken Language Systems.Berlin and New York:Mouton de Gruyter,1997.
[46]HAMPSHIRE W.A novel objective function for improved phoneme recognition using time delay neural networks[C]//Proc.of the International 1989 Joint Conference on Neural Networks.1989:235-241.
[47]WAIBEL A,HANAZAWA T,HINTON G,et al.Phoneme re-cognition using time-delay neural networks[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1989,37(3):328-339.
[48]HAMPSHIRE J B,WAIBEL A H.A novel objective functionfor improved phoneme recognition using time-delay neural networks[J].IEEE Transactions on Neural Networks,1990,1(2):216-228.
[49]CHOROWSKI J,BAHDANAU D,SERDYUK D,et al.Atten-tion-based models for speech recognition[C]//Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems 2015.2015:577-585.
[50]LI J,YE G,DAS A,et al.Advancing acoustic-to-word CTCmodel[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5794-5798.
[51]YUAN Y,LEUNG C C,XIE L,et al.Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2017:5645-5649.
[52]RAM D,MICULICICH L,BOURLARD H.Multilingual bottleneck features for query by example spoken term detection[C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2019:621-628.
[53]RAM D,MICULICICH L,BOURLARD H.Neural networkbased end-to-end query by example spoken term detection[J].IEEE/ACM Transactionson Audio,Speech,and Language Processing,2020,28:1416-1427.
[54]WATANABE S,HORI T,KIM S,et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1240-1253.
[55]WATANABE S,HORI T,KARITA S,et al.Espnet:end-to-end speech processing toolkit[C]//Interspeech.2018:2207-2211.
[56]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in neural information processing systems.2017:5998-6008.
[57]GAGE P.A new algorithm for data compression[J].C Users Journal,1994,12(2):23-38.
[58]SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[J].arXiv:1508.07909,2015.

相关文章 15

[1]	徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[2]	刘俊鹏, 苏劲松, 黄德根. 融合特定语言适配模块的多语言神经机器翻译 Incorporating Language-specific Adapter into Multilingual Neural Machine Translation 计算机科学, 2022, 49(1): 17-23. https://doi.org/10.11896/jsjkx.210900005
[3]	于东, 谢婉莹, 谷舒豪, 冯洋. 基于语种关联度课程学习的多语言神经机器翻译 Similarity-based Curriculum Learning for Multilingual Neural Machine Translation 计算机科学, 2022, 49(1): 24-30. https://doi.org/10.11896/jsjkx.210800254
[4]	杨润延, 程高峰, 刘建. 基于端到端语音识别的关键词检索技术研究 Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition 计算机科学, 2022, 49(1): 53-58. https://doi.org/10.11896/jsjkx.210800269
[5]	刘创, 熊德意. 多语言问答研究综述 Survey of Multilingual Question Answering 计算机科学, 2022, 49(1): 65-72. https://doi.org/10.11896/jsjkx.210900003
[6]	郑纯军, 王春立, 贾宁. 语音任务下声学特征提取综述 Survey of Acoustic Feature Extraction in Speech Tasks 计算机科学, 2020, 47(5): 110-119. https://doi.org/10.11896/jsjkx.190400122
[7]	张经, 杨健, 苏鹏. 语音识别中单音节识别研究综述 Survey of Monosyllable Recognition in Speech Recognition 计算机科学, 2020, 47(11A): 172-174. https://doi.org/10.11896/jsjkx.200200006
[8]	崔阳, 刘长红. 基于PIFA的语音识别系统评测平台 PIFA-based Evaluation Platform for Speech Recognition System 计算机科学, 2020, 47(11A): 638-641. https://doi.org/10.11896/jsjkx.200500097
[9]	史燕燕, 白静. 融合CFCC和Teager能量算子倒谱参数的语音识别 Speech Recognition Combining CFCC and Teager Energy Operators Cepstral Coefficients 计算机科学, 2019, 46(5): 286-289. https://doi.org/10.11896/j.issn.1002-137X.2019.05.044
[10]	龙星延, 屈丹, 张文林. 结合瓶颈特征的注意力声学模型 Attention Based Acoustics Model Combining Bottleneck Feature LONG Xing-yan QU Dan ZHANG Wen-lin 计算机科学, 2019, 46(1): 260-264. https://doi.org/10.11896／j.issn.1002-137X.2019.01.040
[11]	张爱英. 基于多语言语音数据选择的资源稀缺蒙语语音识别研究 Research on Low-resource Mongolian Speech Recognition Based on Multilingual Speech Data Selection 计算机科学, 2018, 45(9): 308-313. https://doi.org/10.11896／j.issn.1002-137X.2018.09.052
[12]	张爱英,倪崇嘉. 资源稀缺蒙语语音识别研究 Research on Low-resource Mongolian Speech Recognition 计算机科学, 2017, 44(10): 318-322. https://doi.org/10.11896/j.issn.1002-137X.2017.10.057
[13]	魏莹,王双维,潘迪,张玲,许廷发,梁士利. 宽窄带语谱图融合分带投影的特定人汉语词汇识别 Specific Two Words Chinese Lexical Recognition Based on Broadband and Narrowband Spectrogram Feature Fusion with Zoning Projection 计算机科学, 2016, 43(Z11): 215-219. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.049
[14]	李伟林,文剑,马文凯. 基于深度神经网络的语音识别系统研究 Speech Recognition System Based on Deep Neural Network 计算机科学, 2016, 43(Z11): 45-49. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.010
[15]	孙志远,鲁成祥,史忠植,马刚. 深度学习研究与进展 Research and Advances on Deep Learning 计算机科学, 2016, 43(2): 1-8. https://doi.org/10.11896/j.issn.1002-137X.2016.02.001

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed