融合音字特征转换的非自回归Transformer中文语音识别

doi:10.11896/jsjkx.220600144

Abstract

Abstract: The Transformer based on self-attention mechanism shows powerful model performance in speech recognition tasks,where the non-autoregressive Transformer automatic speech recognition model has a faster decoding speed compared with the autoregressive model.However,the increase in speech recognition speed causes a larger decrease in accuracy.To improve the accuracy of the non-autoregressive Transformer speech recognition model,the frame information merging based on connectionist temporal classification(CTC) is introduced firstly,which fuses the speech high-dimensional representation in the frame width range to improve the problem of incomplete feature information in the non-autoregressive Transformer decoder input sequences.Secon-dly,pronunciation-character representation conversion is performed on the model output,and the pronunciation representation is converted into an output containing more character features by fusing contextual information on the pronunciation features of the decoder output,thus improving the recognition error problem of the model with different characters in the same pronunciation.Experiments on the Chinese speech dataset AISHELL-1 show that the proposed model achieves a recognition speed of real time factor(RTF) 0.0028 and recognition accuracy of 8.3% character error rate(CER),demonstrating strong competitiveness among many mainstream Chinese speech recognition algorithms.

Key words: Speech recognition, Transformer, Non-autoregressive, Self-attention mechanism, Representation conversion

CLC Number:

TP391

TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion[J].Computer Science, 2023, 50(8): 111-117.

References

[1]WANG H K,PAN J,LIU C.Research development and forecast of automatic speech recognition technologies[J].Telecommunications Science,2018,34(2):1-11.
[2]HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[3]ZHENG C J,WANG C L,JIA N.Survey of Acoustic FeatureExtraction in Speech Tasks[J].Computer Science,2020,47(5):110-119.
[4]LI S,CAO F.Analysis and Trend Research of End-to-EndFramework Model of Intelligent Speech Technology[J].Computer Science,2022,49(6A):331-336.
[5]AMODEI D,ANANTHANARAYANAN S,ANUBHAI R,et al.Deep speech 2:End-to-end speech recognition in english and mandarin[C]//International Conference on Machine Lear-ning.PMLR.2016:173-182.
[6]BAHDANAU D,CHOROWSKI J,SERDYUK D,et al.End-to-end attention-based large vocabulary speech recognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:4945-4949.
[7]CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech re-cognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:4960-4964.
[8]CHOROWSKI J K,BAHDANAU D,SERDYUK D,et al.At-tention-based models for speech recognition[J].Advances in Neural Information Processing Systems,2015,28:577-585.
[9]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,30:5998-6008.
[10]GUO J,TAN X,HE D,et al.Non-autoregressive neural machine translation with enhanced decoder input[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33(1):3723-3730.
[11]DONG L,XU S,XU B.Speech-transformer:a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5884-5888.
[12]SAK H,SENIOR A,BEAUFAYS F.Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J].arXiv:1402.1128,2014.
[13]CHUNG J,GULCEHRE C,CHO K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv:1412.3555,2014.
[14]GRAVES A.Generating sequences with recurrent neural networks[J].arXiv:1308.0850,2013.
[15]CHEN X,ZHANG S,SONG D,et al.Transformer with bidirectional decoder for speech recognition[J].arXiv:2008.04481,2020.
[16]BAI Y,YI J,TAO J,et al.Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from bert[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1897-1911.
[17]HIGUCHI Y,INAGUMA H,WATANABE S,et al.Improved mask-CTC for non-autoregressive end-to-end ASR[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:8363-8367.
[18]LI J,WANG X,LI Y.The speech transformer for large-scale mandarin chinese speech recognition[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2019).IEEE,2019:7095-7099.
[19]ZHOU P,FAN R,CHEN W,et al.Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding[J].arXiv:1911.00203,2019.
[20]DO M N,LU Y M.Multidimensional filter banks and multiscale geometric representations[J].Foundations and Trends in Signal Processing,2012,5(3):157-264.
[21]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine learning.2006:369-376.
[22]MIAO H,CHENG G,GAO C,et al.Transformer-based onlineCTC/attention end-to-end speech recognition architecture[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6084-6088.
[23]PETERS M E,NEUMANN M,IYYER M,et al.Deep Contextualized Word Representations[J].arXiv:1802.05365,2018.
[24]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[25]BU H,DU J,NA X,et al.Aishell-1:An open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment(O-COCOSDA).IEEE,2017:1-5.
[26]OTT M,EDUNOV S,GRANGIER D,et al.Scaling Neural Machine Translation[J].arXiv:1806.00187,2018.
[27]PARK D S,CHAN W,ZHANG Y,et al.Specaugment:A simple data augmentation method for automatic speech recognition[J].arXiv:1904.08779,2019.
[28]SHAN C,WENG C,WANG G,et al.Component fusion:Lear-ning replaceable language model component for end-to-end speech recognition system[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP).IEEE,2019:5361-5635.
[29]FAN C,YI J,TAO J,et al.Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:198-209.
[30]TIAN Z,YI J,BAI Y,et al.Synchronous transformers for end-to-end speech recognition[C]//2020 IEEE International Confe-rence on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7884-7888.

Related Articles 15

[1]	YAN Mingqiang, YU Pengfei, LI Haiyan, LI Hongsong. Arbitrary Image Style Transfer with Consistent Semantic Style [J]. Computer Science, 2023, 50(7): 129-136.
[2]	ZHU Yuying, GUO Yan, WAN Yizhao, TIAN Kai. New Word Detection Based on Branch Entropy-Segmentation Probability Model [J]. Computer Science, 2023, 50(7): 221-228.
[3]	LI Fan, JIA Dongli, YAO Yumin, TU Jun. Graph Neural Network Few Shot Image Classification Network Based on Residual and Self-attention Mechanism [J]. Computer Science, 2023, 50(6A): 220500104-5.
[4]	BAI Zhengyao, FAN Shenglan, LU Qianjie, ZHOU Xue. COVID-19 Instance Segmentation and Classification Network Based on CT Image Semantics [J]. Computer Science, 2023, 50(6A): 220600142-9.
[5]	YANG Jingyi, LI Fang, KANG Xiaodong, WANG Xiaotian, LIU Hanqing, HAN Junling. Ultrasonic Image Segmentation Based on SegFormer [J]. Computer Science, 2023, 50(6A): 220400273-6.
[6]	DOU Zhi, HU Chenguang, LIANG Jingyi, ZHENG Liming, LIU Guoqi. Lightweight Target Detection Algorithm Based on Improved Yolov4-tiny [J]. Computer Science, 2023, 50(6A): 220700006-7.
[7]	YANG Bin, LIANG Jing, ZHOU Jiawei, ZHAO Mengci. Study on Interpretable Click-Through Rate Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(5): 12-20.
[8]	WANG Xianwang, ZHOU Hao, ZHANG Minghui, ZHU Youwei. Hyperspectral Image Classification Based on Swin Transformer and 3D Residual Multilayer Fusion Network [J]. Computer Science, 2023, 50(5): 155-160.
[9]	YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang. Text-Image Cross-modal Retrieval Based on Transformer [J]. Computer Science, 2023, 50(4): 141-148.
[10]	LIANG Weiliang, LI Yue, WANG Pengfei. Lightweight Face Generation Method Based on TransEditor and Its Application Specification [J]. Computer Science, 2023, 50(2): 221-230.
[11]	CAO Jinjuan, QIAN Zhong, LI Peifeng. End-to-End Event Factuality Identification with Joint Model [J]. Computer Science, 2023, 50(2): 292-299.
[12]	CAI Xiao, CEHN Zhihua, SHENG Bin. SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing [J]. Computer Science, 2023, 50(1): 105-113.
[13]	ZHANG Jingyuan, WANG Hongxia, HE Peisong. Multitask Transformer-based Network for Image Splicing Manipulation Detection [J]. Computer Science, 2023, 50(1): 114-122.
[14]	WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[15]	XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0