计算机科学 ›› 2023, Vol. 50 ›› Issue (8): 111-117.doi: 10.11896/jsjkx.220600144
滕思航, 王烈, 李雅
TENG Sihang, WANG Lie, LI Ya
摘要: 基于自注意力机制的Transformer模型在语音识别任务中展现出了强大的模型性能,其中非自回归Transformer自动语音识别模型与自回归模型相比解码速度更快,然而语音识别速度的提升却造成了准确度的大幅降低。为提升非自回归Transformer语音识别模型的识别准确度,首先引入基于连续时间分类(Connectionist Temporal Classification,CTC)的帧信息合并,在帧宽范围内对语音高维表示向量进行融合,改善非自回归Transformer decoder输入序列的特征信息不完整问题;其次对模型输出进行音字特征转换,在decoder的输出读音特征中融合上下文信息,然后转换为包含更多字符特征的输出,从而改善模型同音不同字的识别错误问题。在中文语音数据集AISHELL-1上的实验结果显示,所提模型实现了实时性因子(Real Time Factor,RTF)0.002 8的识别速度与字符错误率(Character Error Rate,CER)8.3%的识别精度,在众多主流中文语音识别算法中展现出较强的竞争力。
中图分类号:
[1]WANG H K,PAN J,LIU C.Research development and forecast of automatic speech recognition technologies[J].Telecommunications Science,2018,34(2):1-11. [2]HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97. [3]ZHENG C J,WANG C L,JIA N.Survey of Acoustic FeatureExtraction in Speech Tasks[J].Computer Science,2020,47(5):110-119. [4]LI S,CAO F.Analysis and Trend Research of End-to-EndFramework Model of Intelligent Speech Technology[J].Computer Science,2022,49(6A):331-336. [5]AMODEI D,ANANTHANARAYANAN S,ANUBHAI R,et al.Deep speech 2:End-to-end speech recognition in english and mandarin[C]//International Conference on Machine Lear-ning.PMLR.2016:173-182. [6]BAHDANAU D,CHOROWSKI J,SERDYUK D,et al.End-to-end attention-based large vocabulary speech recognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:4945-4949. [7]CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech re-cognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:4960-4964. [8]CHOROWSKI J K,BAHDANAU D,SERDYUK D,et al.At-tention-based models for speech recognition[J].Advances in Neural Information Processing Systems,2015,28:577-585. [9]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,30:5998-6008. [10]GUO J,TAN X,HE D,et al.Non-autoregressive neural machine translation with enhanced decoder input[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33(1):3723-3730. [11]DONG L,XU S,XU B.Speech-transformer:a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5884-5888. [12]SAK H,SENIOR A,BEAUFAYS F.Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J].arXiv:1402.1128,2014. [13]CHUNG J,GULCEHRE C,CHO K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv:1412.3555,2014. [14]GRAVES A.Generating sequences with recurrent neural networks[J].arXiv:1308.0850,2013. [15]CHEN X,ZHANG S,SONG D,et al.Transformer with bidirectional decoder for speech recognition[J].arXiv:2008.04481,2020. [16]BAI Y,YI J,TAO J,et al.Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from bert[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1897-1911. [17]HIGUCHI Y,INAGUMA H,WATANABE S,et al.Improved mask-CTC for non-autoregressive end-to-end ASR[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:8363-8367. [18]LI J,WANG X,LI Y.The speech transformer for large-scale mandarin chinese speech recognition[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2019).IEEE,2019:7095-7099. [19]ZHOU P,FAN R,CHEN W,et al.Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding[J].arXiv:1911.00203,2019. [20]DO M N,LU Y M.Multidimensional filter banks and multiscale geometric representations[J].Foundations and Trends in Signal Processing,2012,5(3):157-264. [21]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine learning.2006:369-376. [22]MIAO H,CHENG G,GAO C,et al.Transformer-based onlineCTC/attention end-to-end speech recognition architecture[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6084-6088. [23]PETERS M E,NEUMANN M,IYYER M,et al.Deep Contextualized Word Representations[J].arXiv:1802.05365,2018. [24]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [25]BU H,DU J,NA X,et al.Aishell-1:An open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment(O-COCOSDA).IEEE,2017:1-5. [26]OTT M,EDUNOV S,GRANGIER D,et al.Scaling Neural Machine Translation[J].arXiv:1806.00187,2018. [27]PARK D S,CHAN W,ZHANG Y,et al.Specaugment:A simple data augmentation method for automatic speech recognition[J].arXiv:1904.08779,2019. [28]SHAN C,WENG C,WANG G,et al.Component fusion:Lear-ning replaceable language model component for end-to-end speech recognition system[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP).IEEE,2019:5361-5635. [29]FAN C,YI J,TAO J,et al.Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:198-209. [30]TIAN Z,YI J,BAI Y,et al.Synchronous transformers for end-to-end speech recognition[C]//2020 IEEE International Confe-rence on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7884-7888. |
|