Computer Science ›› 2023, Vol. 50 ›› Issue (3): 282-290.doi: 10.11896/jsjkx.220100104

• Artificial Intelligence • Previous Articles     Next Articles

Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding

LIU Zhe1, YIN Chengfeng1, LI Tianrui1,2   

  1. 1 School of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China
    2 National Engineering Laboratory of Integrated Transportation Big Data Application Technology,Chengdu 611756,China
  • Received:2022-01-11 Revised:2022-08-25 Online:2023-03-15 Published:2023-03-15
  • About author:LIU Zhe,born in 1998,postgraduate,is a member of China Computer Federation.His main research interests include Chinese spelling check,Chinese grammatical error correction and natural language processing.
    LI Tianrui,born in 1969,Ph.D,professor,Ph.D supervisor,is a distinguished member of China Computer Federation.His main research interests include big data intelligence,rough sets and granular computing.
  • Supported by:
    National Natural Science Foundation of China(61773324),Sichuan Key R & D Project(2020YFG0035) and Fundamental Research Funds for the Central Universities of Ministry of Education of China(2682021ZTPY097).

Abstract: Due to the diversity of Chinese characters and the complexity of Chinese semantic expressions,Chinese spelling che-cking is still an important and challenging task.Existing solutions usually suffer from the inability to dig deeper into the text semantics and often learn the mapping relationship between incorrect and correct characters through pre-established external resources or heuristic rules when exploiting the unique similarity features of Chinese characters.This paper proposes an end-to-end Chinese spelling checking algorithm model BFMBERT(BiGRU-Fusion Mask BERT) that incorporates multi-feature embedding of Chinese characters.The model first uses a pre-training task combining confusion sets to make BERT learn Chinese spelling error knowledge.It then employs a bi-directional GRU network to capture the probability of error for each character in the text.Furthermore,it applies this probability to compute a fusion embedding incorporating semantic,pinyin,and glyph features of Chinese characters.Finally,it feeds this fusion embedding into a mask language model in BERT to predict correct characters.BFMBERT is evaluated on the SIGHAN 2015 benchmark dataset and achieves an F1 value of 82.2,outperforming other baseline models.

Key words: Chinese spelling check, BERT, Text proofreading, Masked language model, Word error proofreading, Pre-training model

CLC Number: 

  • TP181
[1]WU S,LIU C,LEE L.Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013[C]//Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing.Nagoya,Japan:Asian Federation of Natural Language Processing,2013:35-42.
[2]YU L,LEE L,TSENG Y,et al.Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:126-132.
[3]TSENG Y,LEE L,CHANG L,et al.Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Proces-sing.Beijing,China:Association for Computational Linguistics,2015:32-37.
[4]LI C,ZHANG C,ZHENG X,et al.Exploration and Exploitation:Two Ways to Improve Chinese Spelling Correction Models[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Online:Association for Computational Linguistics,2021:441-446.
[5]WANG H,WANG B,DUAN J,et al.Chinese Spelling Error Detection Using a Fusion Lattice LSTM[J].Transactions on Asian and Low-Resource Language Information Processing,2021,20(2):1-11.
[6]LIU C L,LAI M H,TIEN K W,et al.Visually and Phonologically Similar Characters in Incorrect Chinese Words:Analyses,Identification,and Applications[J].ACM Transactions on Asian Language Information Processing,2011,10(2):1-39.
[7]LIU X,CHENG K,LUO Y,et al.A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking[C]//Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing.Nagoya,Japan:Asian Federation of Natural Language Processing,2013:54-58.
[8]YU J,LI Z.Chinese Spelling Error Detection and CorrectionBased on Language Model,Pronunciation,and Shape[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:220-223.
[9]DEVLIN J,CHANG M,LEE K,et al.Bert:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.MN,USA:Association for Computational Linguistics,2019:4171-4186.
[10]CUI Y,CHE W,LIU T,et al.Revisiting Pre-Trained Models for Chinese Natural Language Processing[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.Online:Association for Computational Linguistics,2020:657-668.
[11]MANGU L,BRILL E.Automatic Rule Acquisition for Spelling Correction[C]//Proceedings of the Fourteenth International Conference on Machine Learning.San Francisco,CA,USA:Morgan Kaufmann Publishers Inc,1997:187-194.
[12]CHANG T,CHEN H,YANG C.Introduction to a ProofreadingTool for Chinese Spelling Check Task of SIGHAN-8[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing.Beijing,China:Association for Computational Linguistics,2015:50-55.
[13]JIANG Y,WANG T,LIN T,et al.A Rule Based Chinese Spel-ling and Grammar Detection System Utility[C]//2012 International Conference on System Science and Engineering(ICSSE).IEEE,2012:437-440.
[14]XIONG J,ZHANG Q,ZHANG S,et al.HANSpeller:a Unified Framework for Chinese Spelling Correction[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Proces-sing.Beijing,China:Association for Computational Linguistics,2015:38-45.
[15]CHANG C.A New Approach for Automatic Chinese SpellingCorrection[C]//Proceedings of Natural Language Processing Pacific Rim Symposium.Citeseer,1995:278-283.
[16]HUANG C,PAN H,MING Z,et al.Automatic Detecting/Cor-recting Errors in Chinese Text by An Approximate Word-ma-tching Algorithm[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics.USA:Association for Computational Linguistics,2000:248-254.
[17]CHEN K,LEE H,LEE C,et al.A Study of Language Modeling for Chinese Spelling Check[C]//Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing.Nagoya,Japan:Asian Federation of Natural Language Processing,2013:79-83.
[18]XIN Y,ZHAO H,WANG Y,et al.An Improved Graph Model for Chinese Spell Checking[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Proces-sing.Wuhan,China:Association for Computational Linguistics,2014:157-166.
[19]DONG S,FUNG G P C,LI B,et al.ACE:Automatic Colloquia-lism,Typographical and Orthographic Errors Detection for Chinese Language[C]//Proceedings of COLING 2016,the 26th International Conference on Computational Linguistics:System Demonstrations.Osaka,Japan:The COLING 2016 Organizing Committee,2016:194-197.
[20]CHIU H,WU J,CHANG J S.Chinese Spell Checking Based on Noisy Channel Model[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:202-209.
[21]WANG Y,LIAO Y.Word Vector/Conditional Random Field-based Chinese Spelling Error Detection for SIGHAN-2015 Eva-luation[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing.Beijing,China:Association for Computational Linguistics,2015:46-49.
[22]HUANG Q,HUANG P,ZHANG X,et al.Chinese SpellingCheck System Based on Tri-gram Model[C]//Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:173-178.
[23]LIU M,JIAN P,HUANG H.Introduction to BIT ChineseSpelling Correction System at CLP 2014 Bake-off[C]//Procee-dings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:179-185.
[24]WANG D,TAY Y,ZHONG L.Confusionset-guided PointerNetworks for Chinese Spelling Check[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence,Italy:Association for Computational Linguistics,2019:5780-5785.
[25]TAO Y C,WU W L,HAI Z Y,el al.Text Proofreading Model with LSTM and Integrated Algorithm[J].Journal of Chinese Computer Systems,2020,41(5):967-971.
[26]YANG Z L,LI T R,LIU S J,el al.Streaming Parallel Text Proofreading Based on Spark Streaming[J].Computer Science,2020,47(4):36-41.
[27]LI C,CHEN J,CHANG J S.Chinese Spelling Check Based on Neural Machine Translation[C]//Proceedings of the 32nd Pacific Asia Conference on Language,Information and Computation.Hong Kong:Association for Computational Linguistics,2018.
[28]MALMI E,KRAUSE S,ROTHE S,et al.Encode,Tag,Realize:High-precision Text Editing[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).Hong Kong,China:Association for Computational Linguistics,2019:5054-5065.
[29]LIU Y,OTT M,GOYAL N,et al.Roberta:A Robustly Opti-mized Bert Pretraining Approach[J].arXiv:1907.11692.2019.
[30]YANG Z,DAI Z,YANG Y,et al.Xlnet:Generalized Autoregressive Pretraining for Language Understanding[C]//Procee-dings of the 33rd International Conference on Neural Information Processing Systems.Red Hook,NY,USA:Curran Asso-ciates Inc,2019:11.
[31]CLARK K,LUONG M,LE Q V,et al.Electra:Pre-trainingText Encoders as Discriminators Ratherthan Generators[J].arXiv:2003.10555.2020.
[32]WANG B,CHE W,WU D,et al.Dynamic Connected Networks for Chinese Spelling Check[C]//Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021.Online:Association for Computational Linguistics,2021:2437-2446.
[33]HONG Y.FASPell:A Fast,Adaptable,Simple,Powerful Chi-nese Spell Checker Based On DAE-Decoder Paradigm[C]//Proceedings of the 5th Workshop on Noisy User-generated Text(W-NUT 2019).Hong Kong,China:Association for Computational Linguistics,2019:160-169.
[34]CHENG X,XU W,CHEN K,et al.SpellGCN:IncorporatingPhonological and Visual Similarities into Language Models for Chinese Spelling Check[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online:Association for Computational Linguistics,2020:871-881.
[35]ZHANG S,HUANG H,LIU J,et al.Spelling Error Correction with Soft-Masked BERT[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online:Association for Computational Linguistics,2020:882-890.
[36]HUANG L,LI J,JIANG W,et al.PHMOSpell:Phonologicaland Morphological Knowledge Guided Chinese Spelling Check[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).Online:Association for Computational Linguistics,2021:5958-5967.
[37]MENG Y,WU W,WANG F,et al.Glyce:Glyph-vectors forChinese Character Representations[C]//Advances in Neural Information Processing Systems 32:Annual Conference on Neural Information Processing Systems 2019.Vancouver,BC,Canada:Curran Associates Inc,2019:2742-2753.
[38]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is All you Need[C]//Advances in Neural Information Processing Systems.Curran Associates Inc,2017:5998-6008.
[39]WANG D,SONG Y,LI J,et al.A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Brussels,Belgium:Association for Computational Linguistics,2018:2517-2527.
[40]CUI Y,CHE W,LIU T,et al.Pre-training with Whole WordMasking for Chinese Bert[J].arXiv:1906.08101.2019.
[41]LOSHCHILOV I,HUTTER F.Decoupled Weight Decay Regularization[J].arXiv:1711.05101.2017.
[42]XIE H H,LI A L,LI Y B,et al.CPLM-CSC:Character-based Pre-trained Language Model for Chinese Spelling Checking and Correction[J].Journal of Chinese Information Processing,2021,35(5):38-45.
[43]BAO Z,LI C,WANG R.Chunk-based Chinese Spelling Check with Global Optimization[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.Online:Association for Computational Linguistics,2020:2031-2040.
[44]LIU S,YANG T,YUE T,et al.PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Online:Association for Computational Linguistics,2021:2991-3000.
[1] DENG Liang, QI Panhu, LIU Zhenlong, LI Jingxin, TANG Jiqiang. BGPNRE:A BERT-based Global Pointer Network for Named Entity-Relation Joint Extraction Method [J]. Computer Science, 2023, 50(3): 42-48.
[2] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[3] KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.
[4] YU Ben-gong, ZHANG Zi-wei, WANG Hui-ling. TS-AC-EWM Online Product Ranking Method Based on Multi-level Emotion and Topic Information [J]. Computer Science, 2022, 49(6A): 165-171.
[5] YU Jia-qi, KANG Xiao-dong, BAI Cheng-cheng, LIU Han-qing. New Text Retrieval Model of Chinese Electronic Medical Records [J]. Computer Science, 2022, 49(6A): 32-38.
[6] GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[7] ZHAO Dan-dan, HUANG De-gen, MENG Jia-na, DONG Yu, ZHANG Pan. Chinese Entity Relations Classification Based on BERT-GRU-ATT [J]. Computer Science, 2022, 49(6): 319-325.
[8] LIU Shuo, WANG Geng-run, PENG Jian-hua, LI Ke. Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words [J]. Computer Science, 2022, 49(4): 282-287.
[9] ZHU Ruo-chen, YANG Chang-chun, ZHANG Deng-hui. EGOS-DST:Efficient Schema-guided Approach to One-step Dialogue State Tracking for Diverse Expressions [J]. Computer Science, 2022, 49(11A): 210900246-7.
[10] WEI Ru-ming, CHEN Ruo-yu, LI Han, LIU Xu-hong. Analysis of Technology Trends Based on Deep Learning and Text Measurement [J]. Computer Science, 2022, 49(11A): 211100119-6.
[11] CHEN Zi-zhuo, LIN Xi, WANG Zhong-qing. Stance Detection Based on Argument Boundary Recognition [J]. Computer Science, 2022, 49(11A): 210800180-5.
[12] HOU Hong-xu, SUN Shuo, WU Nier. Survey of Mongolian-Chinese Neural Machine Translation [J]. Computer Science, 2022, 49(1): 31-40.
[13] CHENG Si-wei, GE Wei-yi, WANG Yu, XU Jian. BGCN:Trigger Detection Based on BERT and Graph Convolution Network [J]. Computer Science, 2021, 48(7): 292-298.
[14] DONG Zhe, SHAO Ruo-qi, CHEN Yu-liang, ZHAI Wei-feng. Named Entity Recognition in Food Field Based on BERT and Adversarial Training [J]. Computer Science, 2021, 48(5): 247-253.
[15] SHANG Xi-xue, HAN Hai-ting, ZHU Zheng-zhou. Mechanism Design of Right to Earnings of Data Utilization Based on Evolutionary Game Model [J]. Computer Science, 2021, 48(3): 144-150.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!