基于BERT和多特征融合嵌入的中文拼写检查

doi:10.11896/jsjkx.220100104

计算机科学 ›› 2023, Vol. 50 ›› Issue (3): 282-290.doi: 10.11896/jsjkx.220100104

基于BERT和多特征融合嵌入的中文拼写检查

刘哲¹, 殷成凤¹, 李天瑞^1,2

1 西南交通大学计算机与人工智能学院成都 611756
2 综合交通大数据应用技术国家工程实验室成都 611756

收稿日期:2022-01-11 修回日期:2022-08-25 出版日期:2023-03-15 发布日期:2023-03-15
通讯作者: 李天瑞(trli@swjtu.edu.cn)
作者简介:(liuzhe@my.swjtu.edu.cn)
基金资助:
国家自然科学基金(61773324);四川省重点研发项目(2020YFG0035);中央高校基本科研业务费专项资金(2682021ZTPY097)

Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding

LIU Zhe¹, YIN Chengfeng¹, LI Tianrui^1,2

1 School of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China
2 National Engineering Laboratory of Integrated Transportation Big Data Application Technology,Chengdu 611756,China

Received:2022-01-11 Revised:2022-08-25 Online:2023-03-15 Published:2023-03-15
About author:LIU Zhe,born in 1998,postgraduate,is a member of China Computer Federation.His main research interests include Chinese spelling check,Chinese grammatical error correction and natural language processing.
LI Tianrui,born in 1969,Ph.D,professor,Ph.D supervisor,is a distinguished member of China Computer Federation.His main research interests include big data intelligence,rough sets and granular computing.
Supported by:
National Natural Science Foundation of China(61773324),Sichuan Key R & D Project(2020YFG0035) and Fundamental Research Funds for the Central Universities of Ministry of Education of China(2682021ZTPY097).

摘要/Abstract

摘要： 由于汉字的多样性和中文语义表达的复杂性,中文拼写检查仍是一项重要且富有挑战性的任务。现有的解决方法通常存在无法深入挖掘文本语义的问题,且在利用汉字独特的相似性特征时往往通过预先建立的外部资源或是启发式规则来学习错误字符与正确字符之间的映射关系。文中提出了一种融合汉字多特征嵌入的端到端中文拼写检查算法模型BFMBERT(BiGRU-Fusion Mask BERT)。该模型首先利用结合混淆集的预训练任务使BERT学习中文拼写错误知识,然后使用双向GRU网络捕获文本中每个字符错误的概率,利用该概率计算汉字语义、拼音和字形特征的融合嵌入表示,最后将这种融合嵌入输入到BERT中的掩码语言模型(Mask Language Model,MLM)以预测正确字符。在SIGHAN 2015基准数据集上对BFMBERT进行了评测,取得了82.2的F1值,其性能优于其他基线模型。

关键词: 中文拼写检查, BERT, 文本校对, 掩码语言模型, 字词错误校对, 预训练模型

Abstract: Due to the diversity of Chinese characters and the complexity of Chinese semantic expressions,Chinese spelling che-cking is still an important and challenging task.Existing solutions usually suffer from the inability to dig deeper into the text semantics and often learn the mapping relationship between incorrect and correct characters through pre-established external resources or heuristic rules when exploiting the unique similarity features of Chinese characters.This paper proposes an end-to-end Chinese spelling checking algorithm model BFMBERT(BiGRU-Fusion Mask BERT) that incorporates multi-feature embedding of Chinese characters.The model first uses a pre-training task combining confusion sets to make BERT learn Chinese spelling error knowledge.It then employs a bi-directional GRU network to capture the probability of error for each character in the text.Furthermore,it applies this probability to compute a fusion embedding incorporating semantic,pinyin,and glyph features of Chinese characters.Finally,it feeds this fusion embedding into a mask language model in BERT to predict correct characters.BFMBERT is evaluated on the SIGHAN 2015 benchmark dataset and achieves an F1 value of 82.2,outperforming other baseline models.

Key words: Chinese spelling check, BERT, Text proofreading, Masked language model, Word error proofreading, Pre-training model

中图分类号:

TP181

刘哲, 殷成凤, 李天瑞. 基于BERT和多特征融合嵌入的中文拼写检查[J]. 计算机科学, 2023, 50(3): 282-290. https://doi.org/10.11896/jsjkx.220100104

LIU Zhe, YIN Chengfeng, LI Tianrui. Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding[J]. Computer Science, 2023, 50(3): 282-290. https://doi.org/10.11896/jsjkx.220100104

参考文献

[1]WU S,LIU C,LEE L.Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013[C]//Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing.Nagoya,Japan:Asian Federation of Natural Language Processing,2013:35-42.
[2]YU L,LEE L,TSENG Y,et al.Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:126-132.
[3]TSENG Y,LEE L,CHANG L,et al.Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Proces-sing.Beijing,China:Association for Computational Linguistics,2015:32-37.
[4]LI C,ZHANG C,ZHENG X,et al.Exploration and Exploitation:Two Ways to Improve Chinese Spelling Correction Models[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Online:Association for Computational Linguistics,2021:441-446.
[5]WANG H,WANG B,DUAN J,et al.Chinese Spelling Error Detection Using a Fusion Lattice LSTM[J].Transactions on Asian and Low-Resource Language Information Processing,2021,20(2):1-11.
[6]LIU C L,LAI M H,TIEN K W,et al.Visually and Phonologically Similar Characters in Incorrect Chinese Words:Analyses,Identification,and Applications[J].ACM Transactions on Asian Language Information Processing,2011,10(2):1-39.
[7]LIU X,CHENG K,LUO Y,et al.A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking[C]//Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing.Nagoya,Japan:Asian Federation of Natural Language Processing,2013:54-58.
[8]YU J,LI Z.Chinese Spelling Error Detection and CorrectionBased on Language Model,Pronunciation,and Shape[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:220-223.
[9]DEVLIN J,CHANG M,LEE K,et al.Bert:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.MN,USA:Association for Computational Linguistics,2019:4171-4186.
[10]CUI Y,CHE W,LIU T,et al.Revisiting Pre-Trained Models for Chinese Natural Language Processing[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.Online:Association for Computational Linguistics,2020:657-668.
[11]MANGU L,BRILL E.Automatic Rule Acquisition for Spelling Correction[C]//Proceedings of the Fourteenth International Conference on Machine Learning.San Francisco,CA,USA:Morgan Kaufmann Publishers Inc,1997:187-194.
[12]CHANG T,CHEN H,YANG C.Introduction to a ProofreadingTool for Chinese Spelling Check Task of SIGHAN-8[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing.Beijing,China:Association for Computational Linguistics,2015:50-55.
[13]JIANG Y,WANG T,LIN T,et al.A Rule Based Chinese Spel-ling and Grammar Detection System Utility[C]//2012 International Conference on System Science and Engineering(ICSSE).IEEE,2012:437-440.
[14]XIONG J,ZHANG Q,ZHANG S,et al.HANSpeller:a Unified Framework for Chinese Spelling Correction[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Proces-sing.Beijing,China:Association for Computational Linguistics,2015:38-45.
[15]CHANG C.A New Approach for Automatic Chinese SpellingCorrection[C]//Proceedings of Natural Language Processing Pacific Rim Symposium.Citeseer,1995:278-283.
[16]HUANG C,PAN H,MING Z,et al.Automatic Detecting/Cor-recting Errors in Chinese Text by An Approximate Word-ma-tching Algorithm[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics.USA:Association for Computational Linguistics,2000:248-254.
[17]CHEN K,LEE H,LEE C,et al.A Study of Language Modeling for Chinese Spelling Check[C]//Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing.Nagoya,Japan:Asian Federation of Natural Language Processing,2013:79-83.
[18]XIN Y,ZHAO H,WANG Y,et al.An Improved Graph Model for Chinese Spell Checking[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Proces-sing.Wuhan,China:Association for Computational Linguistics,2014:157-166.
[19]DONG S,FUNG G P C,LI B,et al.ACE:Automatic Colloquia-lism,Typographical and Orthographic Errors Detection for Chinese Language[C]//Proceedings of COLING 2016,the 26th International Conference on Computational Linguistics:System Demonstrations.Osaka,Japan:The COLING 2016 Organizing Committee,2016:194-197.
[20]CHIU H,WU J,CHANG J S.Chinese Spell Checking Based on Noisy Channel Model[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:202-209.
[21]WANG Y,LIAO Y.Word Vector/Conditional Random Field-based Chinese Spelling Error Detection for SIGHAN-2015 Eva-luation[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing.Beijing,China:Association for Computational Linguistics,2015:46-49.
[22]HUANG Q,HUANG P,ZHANG X,et al.Chinese SpellingCheck System Based on Tri-gram Model[C]//Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:173-178.
[23]LIU M,JIAN P,HUANG H.Introduction to BIT ChineseSpelling Correction System at CLP 2014 Bake-off[C]//Procee-dings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Wuhan,China:Association for Computational Linguistics,2014:179-185.
[24]WANG D,TAY Y,ZHONG L.Confusionset-guided PointerNetworks for Chinese Spelling Check[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence,Italy:Association for Computational Linguistics,2019:5780-5785.
[25]TAO Y C,WU W L,HAI Z Y,el al.Text Proofreading Model with LSTM and Integrated Algorithm[J].Journal of Chinese Computer Systems,2020,41(5):967-971.
[26]YANG Z L,LI T R,LIU S J,el al.Streaming Parallel Text Proofreading Based on Spark Streaming[J].Computer Science,2020,47(4):36-41.
[27]LI C,CHEN J,CHANG J S.Chinese Spelling Check Based on Neural Machine Translation[C]//Proceedings of the 32nd Pacific Asia Conference on Language,Information and Computation.Hong Kong:Association for Computational Linguistics,2018.
[28]MALMI E,KRAUSE S,ROTHE S,et al.Encode,Tag,Realize:High-precision Text Editing[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).Hong Kong,China:Association for Computational Linguistics,2019:5054-5065.
[29]LIU Y,OTT M,GOYAL N,et al.Roberta:A Robustly Opti-mized Bert Pretraining Approach[J].arXiv:1907.11692.2019.
[30]YANG Z,DAI Z,YANG Y,et al.Xlnet:Generalized Autoregressive Pretraining for Language Understanding[C]//Procee-dings of the 33rd International Conference on Neural Information Processing Systems.Red Hook,NY,USA:Curran Asso-ciates Inc,2019:11.
[31]CLARK K,LUONG M,LE Q V,et al.Electra:Pre-trainingText Encoders as Discriminators Ratherthan Generators[J].arXiv:2003.10555.2020.
[32]WANG B,CHE W,WU D,et al.Dynamic Connected Networks for Chinese Spelling Check[C]//Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021.Online:Association for Computational Linguistics,2021:2437-2446.
[33]HONG Y.FASPell:A Fast,Adaptable,Simple,Powerful Chi-nese Spell Checker Based On DAE-Decoder Paradigm[C]//Proceedings of the 5th Workshop on Noisy User-generated Text(W-NUT 2019).Hong Kong,China:Association for Computational Linguistics,2019:160-169.
[34]CHENG X,XU W,CHEN K,et al.SpellGCN:IncorporatingPhonological and Visual Similarities into Language Models for Chinese Spelling Check[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online:Association for Computational Linguistics,2020:871-881.
[35]ZHANG S,HUANG H,LIU J,et al.Spelling Error Correction with Soft-Masked BERT[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online:Association for Computational Linguistics,2020:882-890.
[36]HUANG L,LI J,JIANG W,et al.PHMOSpell:Phonologicaland Morphological Knowledge Guided Chinese Spelling Check[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).Online:Association for Computational Linguistics,2021:5958-5967.
[37]MENG Y,WU W,WANG F,et al.Glyce:Glyph-vectors forChinese Character Representations[C]//Advances in Neural Information Processing Systems 32:Annual Conference on Neural Information Processing Systems 2019.Vancouver,BC,Canada:Curran Associates Inc,2019:2742-2753.
[38]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is All you Need[C]//Advances in Neural Information Processing Systems.Curran Associates Inc,2017:5998-6008.
[39]WANG D,SONG Y,LI J,et al.A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Brussels,Belgium:Association for Computational Linguistics,2018:2517-2527.
[40]CUI Y,CHE W,LIU T,et al.Pre-training with Whole WordMasking for Chinese Bert[J].arXiv:1906.08101.2019.
[41]LOSHCHILOV I,HUTTER F.Decoupled Weight Decay Regularization[J].arXiv:1711.05101.2017.
[42]XIE H H,LI A L,LI Y B,et al.CPLM-CSC:Character-based Pre-trained Language Model for Chinese Spelling Checking and Correction[J].Journal of Chinese Information Processing,2021,35(5):38-45.
[43]BAO Z,LI C,WANG R.Chunk-based Chinese Spelling Check with Global Optimization[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.Online:Association for Computational Linguistics,2020:2031-2040.
[44]LIU S,YANG T,YUE T,et al.PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Online:Association for Computational Linguistics,2021:2991-3000.

相关文章 15

[1]	邓亮, 齐攀虎, 刘振龙, 李敬鑫, 唐积强. BGPNRE:一种基于BERT的全局指针网络实体关系联合抽取方法 BGPNRE:A BERT-based Global Pointer Network for Named Entity-Relation Joint Extraction Method 计算机科学, 2023, 50(3): 42-48. https://doi.org/10.11896/jsjkx.220600239
[2]	曹金娟, 钱忠, 李培峰. 基于联合模型的端到端事件可信度识别 End-to-End Event Factuality Identification with Joint Model 计算机科学, 2023, 50(2): 292-299. https://doi.org/10.11896/jsjkx.211200108
[3]	侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[4]	于家畦, 康晓东, 白程程, 刘汉卿. 一种新的中文电子病历文本检索模型 New Text Retrieval Model of Chinese Electronic Medical Records 计算机科学, 2022, 49(6A): 32-38. https://doi.org/10.11896/jsjkx.210400198
[5]	康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[6]	余本功, 张子薇, 王惠灵. 一种融合多层次情感和主题信息的TS-AC-EWM在线商品排序方法 TS-AC-EWM Online Product Ranking Method Based on Multi-level Emotion and Topic Information 计算机科学, 2022, 49(6A): 165-171. https://doi.org/10.11896/jsjkx.210400238
[7]	郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[8]	赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀. 基于BERT-GRU-ATT模型的中文实体关系分类 Chinese Entity Relations Classification Based on BERT-GRU-ATT 计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123
[9]	刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[10]	朱若尘, 杨长春, 张登辉. EGOS-DST:对话现象感知和模式引导的一步对话状态追踪算法 EGOS-DST:Efficient Schema-guided Approach to One-step Dialogue State Tracking for Diverse Expressions 计算机科学, 2022, 49(11A): 210900246-7. https://doi.org/10.11896/jsjkx.210900246
[11]	韦入铭, 陈若愚, 李晗, 刘旭红. 基于深度学习与文本计量的技术趋势分析 Analysis of Technology Trends Based on Deep Learning and Text Measurement 计算机科学, 2022, 49(11A): 211100119-6. https://doi.org/10.11896/jsjkx.211100119
[12]	陈孜卓, 林夕, 王中卿. 基于论据边界识别的立场分类研究 Stance Detection Based on Argument Boundary Recognition 计算机科学, 2022, 49(11A): 210800180-5. https://doi.org/10.11896/jsjkx.210800180
[13]	姚奕, 杨帆. 联合知识图谱和预训练模型的中文关键词抽取方法 Chinese Keyword Extraction Method Combining Knowledge Graph and Pre-training Model 计算机科学, 2022, 49(10): 243-251. https://doi.org/10.11896/jsjkx.210800176
[14]	侯宏旭, 孙硕, 乌尼尔. 蒙汉神经机器翻译研究综述 Survey of Mongolian-Chinese Neural Machine Translation 计算机科学, 2022, 49(1): 31-40. https://doi.org/10.11896/jsjkx.210900006
[15]	汤世征, 张岩峰. DragDL:一种易用的深度学习模型可视化构建系统 DragDL:An Easy-to-Use Graphical DL Model Construction System 计算机科学, 2021, 48(8): 220-225. https://doi.org/10.11896/jsjkx.200900045

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于BERT和多特征融合嵌入的中文拼写检查

Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0