基于局部上下文特征的组合的中文真词错误自动校对研究

doi:10.11896/j.issn.1002-137X.2016.12.005

Abstract

Abstract: Similar to the English context-sensitive spelling correction,real-word error in Chinese refers to the error that a Chinese word is misused to another Chinese Word.In the paper,a Chinese real word error detection and correction method based on confusion sets was proposed.This method extracts local feature around the aim word which forms left adjacent bigram,right adjacent bigram and three trigrams.The probability of bigram and trigram are computed with the confusion words in the aim word’s confusion set.A model based on multi-feature fusion was proposed and rules was used to find the real-word errors.We classified the result into two types,marking the errors and rewriting the errors.In the experiment,we used 18 group confusion sets and built 20000 sentences corpus to validate the algorithm.The results show that the proposed method can find the real-word errors in Chinese texts and give the correction lists.The proposed method combines automatic error-detecting and automatic error-correction.

Key words: Real-word error,Confusion set,Context feature,NGram model

LIU Liang-liang and CAO Cun-gen. Chinese Real-word Error Automatic Proofreading Based on Combining of Local Context Features[J].Computer Science, 2016, 43(12): 30-35.

References

[1] Kuckich K.Techniques for automatically correcting words intext[J].ACM Computing Surveys (CSUR),1992,24(4):377-439
[2] Mays E,Damerau F J,Mercer R L.Context based spelling correction[J].Information Processing & Management,1991,27(5):517-522
[3] Berlinsky-Schine A.Context-based detection of real word typographical errors using markov models[R].Cornell University,Ithaca,NY,2004
[4] Marshall I.Choice of grammatical word-class without global syn-tactic analysis:tagging words in the LOB corpus[J].Computers and the Humanities,1983,17(3):139-150
[5] Garside R,Sampson G,Leech G.The computational analysis ofEnglish:A corpus-based approach[J].Lingua,1991,5(4):365-367
[6] Golding A R,Schabes Y.Combining trigram-based and feature-based methods for context-sensitive spelling correction[C]∥Proceedings of the 34th annual meeting on Association for Computational Linguistics.1996:71-78
[7] Gale W A,Church K W,Yarowsky D.A method for disambigua-ting word senses in a large corpus[J].Computers and the Humanities,1992,26(5/6):415-439
[8] Yarowsky D.Decision lists for lexical ambiguity resolution:Application to accent restoration in Spanish and French[C]∥Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics.1994:88-95
[9] Golding A R.A Bayesian hybrid method for context-sensitive spelling correction[C]∥Proceedings of the Third Workshop on Very Large Corpora.1995,3:39-53
[10] Jones M P,Martin J H.Contextual spelling correction using latent semantic analysis[C]∥Proceedings of the Fifth Conference on Applied Natural Language Processing.1997:166-173
[11] Golding A R,Roth D.A winnow- based approach to context-sensitive spelling correction[J].Machine Learning,1999,34(1-3):107-130
[12] Roth D,Zelenko D.Part of speech tagging using a network of linear separators[C]∥Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2.1998:1136-1142
[13] Carlson A,Cumby C,Rosen J,et al.The SNoW learning architecture[R].Technical Report UIUCDCS,1999
[14] Hirst G,St-Onge D.Lexical chains as representations of context for the detection and correction of malapropisms[M]∥WordNet:An Electronic Lexical Database,1997:305-332
[15] Hirst G,Budanitsky A.Correcting real- word spelling errors by restoring lexical cohesion[J].Natural Language Engineering,2005,11(1):87-111
[16] Atwell E,Elliott S.Dealing with ill-formed English text[M]∥The Computational Analysis of English:A Corpus-Based Approach,1987:120-138
[17] Gale W A,Church K W.Estimation procedures for languagecontext:poor estimates are worse than none[M]∥Compstat.1990:69-74
[18] Church K W,Gale WA.Probability scoring for spelling correction[J].Statistics and Computing,1991,1(2):93-103
[19] Islam A,Inkpen D.Real-word spelling correction using Google Web IT 3-grams[C]∥Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.Volume 3.2009:1241-1249
[20] Shi De-sheng,Wang Liang-zhi,Chen Zhi-da,et al.A Statistics-based Approache for Automatic Detecting Errors in Chinese Text[J].Computer and Communications,1992,8:19-26(in Chinese) 施得胜,王良志,陈志达,等.基于统计的中文错字侦测法[J].电脑与通讯,1992,8:19-26
[21] Zhang Zhao-huang.Automatic Error Detection and Correction of ChineseText[J].Communications of COLIPS,1994,4(2):143-149(in Chinese) 张照煌.中文错别字自动订正方法初探[J].Communications of COLIPS,1994,4(2):143-149
[22] Zhang L,Zhou M,Huang C,et al.Multifeature-based approach to automatic error detection and correction of Chinese text[C]∥Proceedings of the First Workshop on Natural Language Processing and Neural Networks.1999
[23] Ma Jin-shan,Zhang Yu,Liu Ting,et al.Detecting Chinese Text Errors Based on Trigram and Dependency Parsing[J].Journal of the China Society for Scintific and Technical Information,2005,23(6):723-728(in Chinese) 马金山,张宇,刘挺,等.利用三元模型及依存分析查找中文文本错误[J].情报学报,2005,23(6):723-728
[24] Zhang Yang-sen,Cao Yuan-da,Yu Shi-wen.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J].J ournal of Chinese Information Processing,2006,20(4):1-7(in Chinese) 张仰森,曹元大,俞士汶.基于规则与统计相结合的中文文本自动查错模型与算法[J].中文信息学报,2006,20(4):1-7
[25] Wu Lin,Zhang Yang-sen.Reasoning Model of Multi-level Chinese Text Error-detecting Based on Knowledge Bases[J].Computer Engineering,2012,38(20):21-25(in Chinese) 吴林,张仰森.基于知识库的多层级中文文本查错推理模型[J].Computer Engineering,2012,38(20):21-25
[26] Liu Liang-liang,Wang Shi,Wang Dong-sheng,et al.Automatic Text Error Detection in Domain Question Answering[J].Journal of Chinese Information Processing,2013,27(3):77-83(in Chinese) 刘亮亮,王石,王东升,等.领域问答系统中的文本错误自动发现方法[J].中文信息学报,2013,27(3):77-83
[27] Shi Heng-li,Liu Liang-liang,Wang Shi,et al.Research on Method of Constructing Chinese Character Confusion Set[J].Computer Science,2014,1(8):229-232,3(in Chinese) 施恒利,刘亮亮,王石,等.汉字种子混淆集的构建方法研究[J].计算机科学,2014,41(8):229-232,3

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Chinese Real-word Error Automatic Proofreading Based on Combining of Local Context Features

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0