基于局部上下文特征的组合的中文真词错误自动校对研究

doi:10.11896/j.issn.1002-137X.2016.12.005

摘要/Abstract

摘要： 中文的真词错误类似于英文的真词错误,指一个中文词错成另一个词典中的词。提出一种基于混淆集的真词错误发现方法,通过对目标词的局部特征的提取,形成局部左邻接二元、右邻接二元及3个三元特征,然后通过和目标词对应的混淆集中的混淆词来估计二元概率和三元概率。最后提出一种多特征融合的模型,然后利用规则来判断中文文本中的真词错误。将查错结果分为标记错误和更改错误两种类型,采用18组混淆集,构造2万行的测试语料进行实验。实验表明,该方法能有效地发现中文文本中的真词错误,并且能给出真词错误的修改建议。该方法是一种集自动查错和自动纠错于一体的中文文本自动校对方法。

关键词: 真词错误,混淆集,上下文特征,NGram模型

Abstract: Similar to the English context-sensitive spelling correction,real-word error in Chinese refers to the error that a Chinese word is misused to another Chinese Word.In the paper,a Chinese real word error detection and correction method based on confusion sets was proposed.This method extracts local feature around the aim word which forms left adjacent bigram,right adjacent bigram and three trigrams.The probability of bigram and trigram are computed with the confusion words in the aim word’s confusion set.A model based on multi-feature fusion was proposed and rules was used to find the real-word errors.We classified the result into two types,marking the errors and rewriting the errors.In the experiment,we used 18 group confusion sets and built 20000 sentences corpus to validate the algorithm.The results show that the proposed method can find the real-word errors in Chinese texts and give the correction lists.The proposed method combines automatic error-detecting and automatic error-correction.

Key words: Real-word error,Confusion set,Context feature,NGram model

刘亮亮,曹存根. 基于局部上下文特征的组合的中文真词错误自动校对研究[J]. 计算机科学, 2016, 43(12): 30-35. https://doi.org/10.11896/j.issn.1002-137X.2016.12.005

LIU Liang-liang and CAO Cun-gen. Chinese Real-word Error Automatic Proofreading Based on Combining of Local Context Features[J]. Computer Science, 2016, 43(12): 30-35. https://doi.org/10.11896/j.issn.1002-137X.2016.12.005

参考文献

[1] Kuckich K.Techniques for automatically correcting words intext[J].ACM Computing Surveys (CSUR),1992,24(4):377-439
[2] Mays E,Damerau F J,Mercer R L.Context based spelling correction[J].Information Processing & Management,1991,27(5):517-522
[3] Berlinsky-Schine A.Context-based detection of real word typographical errors using markov models[R].Cornell University,Ithaca,NY,2004
[4] Marshall I.Choice of grammatical word-class without global syn-tactic analysis:tagging words in the LOB corpus[J].Computers and the Humanities,1983,17(3):139-150
[5] Garside R,Sampson G,Leech G.The computational analysis ofEnglish:A corpus-based approach[J].Lingua,1991,5(4):365-367
[6] Golding A R,Schabes Y.Combining trigram-based and feature-based methods for context-sensitive spelling correction[C]∥Proceedings of the 34th annual meeting on Association for Computational Linguistics.1996:71-78
[7] Gale W A,Church K W,Yarowsky D.A method for disambigua-ting word senses in a large corpus[J].Computers and the Humanities,1992,26(5/6):415-439
[8] Yarowsky D.Decision lists for lexical ambiguity resolution:Application to accent restoration in Spanish and French[C]∥Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics.1994:88-95
[9] Golding A R.A Bayesian hybrid method for context-sensitive spelling correction[C]∥Proceedings of the Third Workshop on Very Large Corpora.1995,3:39-53
[10] Jones M P,Martin J H.Contextual spelling correction using latent semantic analysis[C]∥Proceedings of the Fifth Conference on Applied Natural Language Processing.1997:166-173
[11] Golding A R,Roth D.A winnow- based approach to context-sensitive spelling correction[J].Machine Learning,1999,34(1-3):107-130
[12] Roth D,Zelenko D.Part of speech tagging using a network of linear separators[C]∥Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2.1998:1136-1142
[13] Carlson A,Cumby C,Rosen J,et al.The SNoW learning architecture[R].Technical Report UIUCDCS,1999
[14] Hirst G,St-Onge D.Lexical chains as representations of context for the detection and correction of malapropisms[M]∥WordNet:An Electronic Lexical Database,1997:305-332
[15] Hirst G,Budanitsky A.Correcting real- word spelling errors by restoring lexical cohesion[J].Natural Language Engineering,2005,11(1):87-111
[16] Atwell E,Elliott S.Dealing with ill-formed English text[M]∥The Computational Analysis of English:A Corpus-Based Approach,1987:120-138
[17] Gale W A,Church K W.Estimation procedures for languagecontext:poor estimates are worse than none[M]∥Compstat.1990:69-74
[18] Church K W,Gale WA.Probability scoring for spelling correction[J].Statistics and Computing,1991,1(2):93-103
[19] Islam A,Inkpen D.Real-word spelling correction using Google Web IT 3-grams[C]∥Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.Volume 3.2009:1241-1249
[20] Shi De-sheng,Wang Liang-zhi,Chen Zhi-da,et al.A Statistics-based Approache for Automatic Detecting Errors in Chinese Text[J].Computer and Communications,1992,8:19-26(in Chinese) 施得胜,王良志,陈志达,等.基于统计的中文错字侦测法[J].电脑与通讯,1992,8:19-26
[21] Zhang Zhao-huang.Automatic Error Detection and Correction of ChineseText[J].Communications of COLIPS,1994,4(2):143-149(in Chinese) 张照煌.中文错别字自动订正方法初探[J].Communications of COLIPS,1994,4(2):143-149
[22] Zhang L,Zhou M,Huang C,et al.Multifeature-based approach to automatic error detection and correction of Chinese text[C]∥Proceedings of the First Workshop on Natural Language Processing and Neural Networks.1999
[23] Ma Jin-shan,Zhang Yu,Liu Ting,et al.Detecting Chinese Text Errors Based on Trigram and Dependency Parsing[J].Journal of the China Society for Scintific and Technical Information,2005,23(6):723-728(in Chinese) 马金山,张宇,刘挺,等.利用三元模型及依存分析查找中文文本错误[J].情报学报,2005,23(6):723-728
[24] Zhang Yang-sen,Cao Yuan-da,Yu Shi-wen.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J].J ournal of Chinese Information Processing,2006,20(4):1-7(in Chinese) 张仰森,曹元大,俞士汶.基于规则与统计相结合的中文文本自动查错模型与算法[J].中文信息学报,2006,20(4):1-7
[25] Wu Lin,Zhang Yang-sen.Reasoning Model of Multi-level Chinese Text Error-detecting Based on Knowledge Bases[J].Computer Engineering,2012,38(20):21-25(in Chinese) 吴林,张仰森.基于知识库的多层级中文文本查错推理模型[J].Computer Engineering,2012,38(20):21-25
[26] Liu Liang-liang,Wang Shi,Wang Dong-sheng,et al.Automatic Text Error Detection in Domain Question Answering[J].Journal of Chinese Information Processing,2013,27(3):77-83(in Chinese) 刘亮亮,王石,王东升,等.领域问答系统中的文本错误自动发现方法[J].中文信息学报,2013,27(3):77-83
[27] Shi Heng-li,Liu Liang-liang,Wang Shi,et al.Research on Method of Constructing Chinese Character Confusion Set[J].Computer Science,2014,1(8):229-232,3(in Chinese) 施恒利,刘亮亮,王石,等.汉字种子混淆集的构建方法研究[J].计算机科学,2014,41(8):229-232,3

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed