Computer Science ›› 2026, Vol. 53 ›› Issue (6A): 250600112-8.doi: 10.11896/jsjkx.250600112

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Accurate Recognition of Dialect Based on CTC-Conformer Model

SHEN Yingchun1, FENG Xiaohan2, LI Qian3   

  1. 1 Hanjiang National Laboratory,Wuhan 430064,China
    2 Hangzhou Zhiyuan Research Institute Co.,Ltd.,Hangzhou 310007,China
    3 China Mobile Group Hubei Co.,Ltd.,Wuhan 430024,China
  • Online:2026-06-16 Published:2026-06-12
  • About author:SHEN Yingchun,born in 1972,Ph.D,professor.His main research interests include command & control and com-puter system architecture.
    FENG Xiaohan,born in 1998,postgra-duate,assistant engineer.Her main research interest is NLP.

Abstract: With the rapid development of speech recognition technology,dialect speech recognition has become significant in va-rious application scenarios.To address challenges such as phonetic variations,speech speed differences,and noise interference in dialect recognition,this paper proposes a dialect speech recognition method based on the CTC-Conformer model,aiming to improve recognition accuracy and robustness for dialect speech.The model combines the Conformer architecture and CTC mechanism.The encoder uses convolutional neural networks and multi-head self-attention mechanisms to extract local features and long-range dependencies from audio,enhancing the understanding of dialect speech.The decoder adopts the CTC mechanism and a dual-attention mechanism,reducing the need for alignment and enhancing contextual modeling ability.A multi-task learning stra-tegy optimizes the balance between CTC loss and cross-entropy loss,further improving recognition accuracy.Experimental results show that the proposed CTC-Conformer model achieves a 79.08% character accuracy on standard test sets,and it maintains stable perfor-mance in noisy environments,with an accuracy of 65.20% even in severe noise conditions,demonstrating its robustness and precision.In summary,the proposedCTC-Conformer model provides an efficient and robust solution for dialect speech recognition,with broad potential for real-world applications.

Key words: Speech recognition, Dialect recognition, Attention mechanism, CTC-Conformer, Data augmentation

CLC Number: 

  • TP391
[1] GALES M,YOUNG S.The application of hidden Markov mo-dels in speech recognition[J].Foundations and Trends© in Signal Processing,2008,1(3):195-304.
[2] WU C,SUN H,HUANG K,et al.MPSA-Conformer-CTC/Attention:A high-accuracy,low-complexity end-to-end approach for Tibetan speech recognition[J].Sensors,2024,24(21):6824.
[3] GRAVES A,JAITLY N.Towards end-to-end speech recogni-tion with recurrent neural networks[C]//International Confe-rence on Machine Learning.PMLR,2014:1764-1772.
[4] AMODEI D,ANANTHANARAYANAN S,ANUBHAI R,et al.Deep speech 2:End-to-end speech recognition in english and mandarin[C]//International Conference on Machine Lear-ning.PMLR,2016:173-182.
[5] CHOROWSKI J K,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[J].Advances in Neural Information Processing Systems,2015,28.
[6] GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[C]//Proc.Interspeech 2020.2020:5036-5040.
[7] LI Q,MAI Q,WANG M,et al.Chinese dialect speech recognition:a comprehensive survey[J].Artificial Intelligence Review,2024,57(2):25.
[8] PRABHAVALKAR R,HORI T,SAINATH T N,et al.End-to-end speech recognition:A survey[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2023,32:325-351.
[9] GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.2006:369-376.
[10] HORI T,WATANABE S,HERSHEY J R.Joint CTC/attentiondecoding for end-to-end speech recognition[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2017:518-529.
[11] KIM S,HORI T,WATANABE S.Joint CTC-attention basedend-to-end speech recognition using multi-task learning[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2017:4835-4839.
[12] CAI X,YUAN J,BIAN Y,et al.W-CTC:a connectionist temporal classification loss with wild cards[C]//International Conference on Learning Representations.2021.
[13] DJEFFAL N,KHEDDAR H,ADDOU D,et al.Automaticspeech recognition with BERT and CTC transformers:A review[C]//2023 2nd International Conference on Electronics,Energy and Measurement(IC2EM).IEEE,2023,1:1-8.
[14] CHEN M,LIU P,YANG H,et al.Towards end-to-end unified recognition for Mandarin and Cantonese[C]//Proc.Interspeech 2024.2024:2365-2369.
[15] PU Y Y,YANG J,WEI H,et al.A study on Yunnan dialectal Chinese speech recognition[C]//2008 International Conference on Machine Learning and Cybernetics.IEEE,2008:2760-2764.
[16] CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech recognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:4960-4964.
[17] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,30.
[18] ZHANG Q,LU H,SAK H,et al.Transformer transducer:Astreamable speech recognition model with transformer encoders and rnn-t loss[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7829-7833.
[19] PRABHU D,GUPTA A,NITSURE O,et al.Improving self-supervised pre-training using accent-specific codebooks[J].arXiv:2407.03734,2024.
[20] JIE Z,SHENGXIANG G,ZHENGTAO Y,et al.DialectMoE:An end-to-end multi-dialect speech recognition model with mixture-of-experts[C]//Proceedings of the 23rd Chinese National Conference on Computational Linguistics(Volume 1:Main Conference).2024:1148-1159.
[21] XU K T,XIE F L,TANG X,et al.FireRedASR:Open-source industrial-grade Mandarin speech recognition models from encoder-decoder to LLM integration[J].arXiv:2501.14350,2025.
[22] WATANABE S,HORI T,KIM S,et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1240-1253.
[23] SUDO Y,MUHAMMAD S,YAN B,et al.4D ASR:Joint modeling of CTC,attention,transducer,and mask-predict decoders[C]//Proc.Interspeech 2023.2023:3312-3316.
[24] ZHU W,SUN S,SHAN C,et al.Skipformer:A skip-and-recover strategy for efficient speech recognition[C]//2024 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2024:1-6.
[25] HU K,LI B,SAINATH T,et al.Mixture-of-expert conformer for streaming multilingual asr[C]//Proc.Interspeech 2023.2023:3327-3331.
[26] YE S,CHEN S,HU X,et al.Sc-moe:Switch conformer mixture of experts for unified streaming and non-streaming code-switching asr[J].arXiv:2406.18021,2024.
[27] SHIM K,LEE J,KIM H.Leveraging adapter for parameter-efficient asr encoder[C]//Proc.Interspeech 2024.2024:2380-2384.
[28] CHEN W,YAN B,SHI J,et al.Improving massively multilingual asr with auxiliary ctc objectives[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5.
[29] BU H,DU J,NA X,et al.Aishell-1:An open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment(O-COCOSDA).IEEE,2017:1-5.
[30] ZHANG B,LV H,GUO P,et al.Wenetspeech:A 10 000+hours multi-domain mandarin corpus for speech recognition[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:6182-6186.
[1] WEI Wei, LI Bicheng, ZHU Zhenshui, ZUO Jun. Semantic Modeling and Co-attention Mechanism for Multimodal Sarcasm Detection Method [J]. Computer Science, 2026, 53(6A): 250400127-6.
[2] FENG Guang, LIN Jianzhong, ZHONG Ting, ZHOU Yuanhua, ZHENG Runting, LIU Tianxiang. Triple Extraction Based on Pixel Difference Convolutional Network and Attention Mechanism [J]. Computer Science, 2026, 53(6A): 250400136-10.
[3] CHEN Dianlong, LIU Tengbin, GAO Xiong, TIAN Zijian, ZHU Wenbing, ZOU Shun, WANG Qiang. Defect Detection of Transmission Line Fittings Based on Multiscale Feature Fusion Attention and Cross-layer Aggregation [J]. Computer Science, 2026, 53(6A): 250600110-7.
[4] DUAN Haiying, WANG Baohui, HUANG He. Malicious Traffic Detection Method of ICMP Covert Channel Based on Baseline Features [J]. Computer Science, 2026, 53(6A): 250200069-11.
[5] LI Jie, WANG Baohui, ZHANG Jingyuan. DDoS Attack Detection Based on Attention Mechanism TCN-BiLSTM [J]. Computer Science, 2026, 53(6A): 250300060-9.
[6] ZHANG Shouyi, SHEN Qiang, GUO Yiran, WANG Hanyu. Rain and Fog Weather Object Detection Algorithm Based on Improved YOLOv8 Model [J]. Computer Science, 2026, 53(6A): 250300090-7.
[7] YANG Geer, WANG Xin, SUN Wei, WANG Xinge, HU Zhongrui, MENG Wenjun, ZHANG Junqiang, WU Xinghui, LIU Jinshan, YAN Yuming. Survey on Positional Encoding Algorithms in Deep Learning [J]. Computer Science, 2026, 53(6A): 250300107-16.
[8] WANG Baohui, TAN Yingjie , CHEN Jixuan. Occlusion Head Pose Estimation Algorithm Based on Riemann Optimization [J]. Computer Science, 2026, 53(6A): 250300109-9.
[9] ZHONG Hao, KONG Qingxuan, CAI Xianqing, LI Zhizhong, SUN Hao. Intelligent Recognition Method Based on Multimodal Feature Fusion [J]. Computer Science, 2026, 53(6A): 250700065-10.
[10] ZHANG Zihao, WU Zezhong. Optimization of HAN-based GNN-Transformer Collaborative Contrastive Learning Framework [J]. Computer Science, 2026, 53(6A): 250900103-8.
[11] KE Changbo, LI Tianhao, ZHANG Bolei, XIAO Fu, XU Kang. Teaching Evaluation Sentiment Analysis Method Based on Capsule Network [J]. Computer Science, 2026, 53(6): 10-18.
[12] LIU Ruyi, LYU Xiaohan, MIAO Qiguang, LU Zixiang, WANG Di. Academic Early Warning Prediction Model Based on Attention Mechanism and FeatureInteraction [J]. Computer Science, 2026, 53(6): 19-29.
[13] XU Zhihong, YANG Xinlei, WANG Liqin, DONG Yongfeng, WANG Xu. Knowledge Tracing Model Based on Relational Learning Memory Network [J]. Computer Science, 2026, 53(6): 84-92.
[14] LI Zongmin, WANG Li, LI Yachuan, LIU Yujie, RONG Guangcai, LIU Weihan, MA Wenkang. High-accuracy Human Pose Estimation Combining Wavelet Analysis and Frequency-DomainAttention [J]. Computer Science, 2026, 53(5): 228-236.
[15] CHEN Boying, SHI Jie. Continuous Image Super-resolution Based on Self-attention Implicit Feature Encoding andDecoding [J]. Computer Science, 2026, 53(5): 237-246.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!