计算机科学 ›› 2023, Vol. 50 ›› Issue (8): 111-117.doi: 10.11896/jsjkx.220600144

• 计算机图形学&多媒体 • 上一篇    下一篇

融合音字特征转换的非自回归Transformer中文语音识别

滕思航, 王烈, 李雅   

  1. 广西大学计算机与电子信息学院 南宁 530004
  • 收稿日期:2022-06-16 修回日期:2023-04-23 出版日期:2023-08-15 发布日期:2023-08-02
  • 通讯作者: 王烈(lwang@gxu.edu.cn)
  • 作者简介:(2013391124@st.gxu.edu.cn)
  • 基金资助:
    广西科技重大专项(桂科AA21077007-1)

Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion

TENG Sihang, WANG Lie, LI Ya   

  1. School of Computer,Electronics and Information,Guangxi University,Nanning 530004,China
  • Received:2022-06-16 Revised:2023-04-23 Online:2023-08-15 Published:2023-08-02
  • About author:TENG Sihang,born in 1996,postgra-duate.His main research interests include deep learning and speech recognition.
    WANG Lie,born in 1969,professor,master supervisor.His main research interests include deep learning,image processing and FPGA.
  • Supported by:
    Science and Technology Key Projects of Guangxi Province(AA21077007-1).

摘要: 基于自注意力机制的Transformer模型在语音识别任务中展现出了强大的模型性能,其中非自回归Transformer自动语音识别模型与自回归模型相比解码速度更快,然而语音识别速度的提升却造成了准确度的大幅降低。为提升非自回归Transformer语音识别模型的识别准确度,首先引入基于连续时间分类(Connectionist Temporal Classification,CTC)的帧信息合并,在帧宽范围内对语音高维表示向量进行融合,改善非自回归Transformer decoder输入序列的特征信息不完整问题;其次对模型输出进行音字特征转换,在decoder的输出读音特征中融合上下文信息,然后转换为包含更多字符特征的输出,从而改善模型同音不同字的识别错误问题。在中文语音数据集AISHELL-1上的实验结果显示,所提模型实现了实时性因子(Real Time Factor,RTF)0.002 8的识别速度与字符错误率(Character Error Rate,CER)8.3%的识别精度,在众多主流中文语音识别算法中展现出较强的竞争力。

关键词: 语音识别, Transformer, 非自回归, 自注意力机制, 特征转换

Abstract: The Transformer based on self-attention mechanism shows powerful model performance in speech recognition tasks,where the non-autoregressive Transformer automatic speech recognition model has a faster decoding speed compared with the autoregressive model.However,the increase in speech recognition speed causes a larger decrease in accuracy.To improve the accuracy of the non-autoregressive Transformer speech recognition model,the frame information merging based on connectionist temporal classification(CTC) is introduced firstly,which fuses the speech high-dimensional representation in the frame width range to improve the problem of incomplete feature information in the non-autoregressive Transformer decoder input sequences.Secon-dly,pronunciation-character representation conversion is performed on the model output,and the pronunciation representation is converted into an output containing more character features by fusing contextual information on the pronunciation features of the decoder output,thus improving the recognition error problem of the model with different characters in the same pronunciation.Experiments on the Chinese speech dataset AISHELL-1 show that the proposed model achieves a recognition speed of real time factor(RTF) 0.0028 and recognition accuracy of 8.3% character error rate(CER),demonstrating strong competitiveness among many mainstream Chinese speech recognition algorithms.

Key words: Speech recognition, Transformer, Non-autoregressive, Self-attention mechanism, Representation conversion

中图分类号: 

  • TP391
[1]WANG H K,PAN J,LIU C.Research development and forecast of automatic speech recognition technologies[J].Telecommunications Science,2018,34(2):1-11.
[2]HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[3]ZHENG C J,WANG C L,JIA N.Survey of Acoustic FeatureExtraction in Speech Tasks[J].Computer Science,2020,47(5):110-119.
[4]LI S,CAO F.Analysis and Trend Research of End-to-EndFramework Model of Intelligent Speech Technology[J].Computer Science,2022,49(6A):331-336.
[5]AMODEI D,ANANTHANARAYANAN S,ANUBHAI R,et al.Deep speech 2:End-to-end speech recognition in english and mandarin[C]//International Conference on Machine Lear-ning.PMLR.2016:173-182.
[6]BAHDANAU D,CHOROWSKI J,SERDYUK D,et al.End-to-end attention-based large vocabulary speech recognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:4945-4949.
[7]CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech re-cognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:4960-4964.
[8]CHOROWSKI J K,BAHDANAU D,SERDYUK D,et al.At-tention-based models for speech recognition[J].Advances in Neural Information Processing Systems,2015,28:577-585.
[9]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,30:5998-6008.
[10]GUO J,TAN X,HE D,et al.Non-autoregressive neural machine translation with enhanced decoder input[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33(1):3723-3730.
[11]DONG L,XU S,XU B.Speech-transformer:a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5884-5888.
[12]SAK H,SENIOR A,BEAUFAYS F.Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J].arXiv:1402.1128,2014.
[13]CHUNG J,GULCEHRE C,CHO K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv:1412.3555,2014.
[14]GRAVES A.Generating sequences with recurrent neural networks[J].arXiv:1308.0850,2013.
[15]CHEN X,ZHANG S,SONG D,et al.Transformer with bidirectional decoder for speech recognition[J].arXiv:2008.04481,2020.
[16]BAI Y,YI J,TAO J,et al.Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from bert[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1897-1911.
[17]HIGUCHI Y,INAGUMA H,WATANABE S,et al.Improved mask-CTC for non-autoregressive end-to-end ASR[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:8363-8367.
[18]LI J,WANG X,LI Y.The speech transformer for large-scale mandarin chinese speech recognition[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2019).IEEE,2019:7095-7099.
[19]ZHOU P,FAN R,CHEN W,et al.Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding[J].arXiv:1911.00203,2019.
[20]DO M N,LU Y M.Multidimensional filter banks and multiscale geometric representations[J].Foundations and Trends in Signal Processing,2012,5(3):157-264.
[21]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine learning.2006:369-376.
[22]MIAO H,CHENG G,GAO C,et al.Transformer-based onlineCTC/attention end-to-end speech recognition architecture[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6084-6088.
[23]PETERS M E,NEUMANN M,IYYER M,et al.Deep Contextualized Word Representations[J].arXiv:1802.05365,2018.
[24]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[25]BU H,DU J,NA X,et al.Aishell-1:An open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment(O-COCOSDA).IEEE,2017:1-5.
[26]OTT M,EDUNOV S,GRANGIER D,et al.Scaling Neural Machine Translation[J].arXiv:1806.00187,2018.
[27]PARK D S,CHAN W,ZHANG Y,et al.Specaugment:A simple data augmentation method for automatic speech recognition[J].arXiv:1904.08779,2019.
[28]SHAN C,WENG C,WANG G,et al.Component fusion:Lear-ning replaceable language model component for end-to-end speech recognition system[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP).IEEE,2019:5361-5635.
[29]FAN C,YI J,TAO J,et al.Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:198-209.
[30]TIAN Z,YI J,BAI Y,et al.Synchronous transformers for end-to-end speech recognition[C]//2020 IEEE International Confe-rence on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7884-7888.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!