汉字种子混淆集的构建方法研究

doi:10.11896/j.issn.1002-137X.2014.08.049

计算机科学 ›› 2014, Vol. 41 ›› Issue (8): 229-232.doi: 10.11896/j.issn.1002-137X.2014.08.049

汉字种子混淆集的构建方法研究

施恒利,刘亮亮,王石,符建辉,张再跃,曹存根

江苏科技大学计算机科学与工程学院镇江212003;江苏科技大学计算机科学与工程学院镇江212003;中国科学院大学研究生院北京100049;中国科学院计算技术研究所北京100190;中国科学院计算技术研究所北京100190;江苏科技大学计算机科学与工程学院镇江212003;中国科学院计算技术研究所北京100190

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受国家自然科学基金重点项目(91224006,61173063,61035004),国家自然科学基金面上项目(61203284)资助

Research on Method of Constructing Chinese Character Confusion Set

SHI Heng-li,LIU Liang-liang,WANG Shi,FU Jian-hui,ZHANG Zai-yue and CAO Cun-gen

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 汉字混淆集是错别字识别中的重要资源之一。在本项研究中,首先手工整理了11935个汉字的可能的错别字,然后以这些汉字为节点、“可错成”关系为边,将混淆集构造成一个错别字混淆集图。由于人工总结错别字具有很大的局限性,因此在种子错别字混淆集图的基础上,设计了自扩展算法和开源外部补充算法来对错别字混淆集图进行扩展,以发现新的错别字对。根据实验,新发现了15133组错别字对。经过随机抽样校对,准确率达到87.35%。

关键词: 错别字混淆集,自扩展,开源数据,基于规则和统计

Abstract: The set of Chinese characters which is easily confused is one of the important sources during the process of identifying wrongly written characters．In the study, firstly we sorted out 11935 possibly-wrongly written characters by hand．Then taking those characters as nodes and "possibly-wrongly written characters" relations as sections, we constructed the set of wrongly written characters which is easily confused into a diagram．Due to the great limitation of manually sorting out wrongly written characters, on the basis of the diagram, we designed the internal-expanding algorithm that expands the set of wrongly written characters and the open source data external-supplementing algorithm that supplements the set of wrongly written characters through large quantity of corpus．In that way, we would expand the diagram and new pairs of wrongly written characters．According to the experiment, we newly found 15133 groups of wrongly written characters pairs．After proofreading samples at random, accuracy reachs 87.35%.

Key words: Wrongly written characters set,Self-expansion,Open source data,Rule and statistics base

施恒利,刘亮亮,王石,符建辉,张再跃,曹存根. 汉字种子混淆集的构建方法研究[J]. 计算机科学, 2014, 41(8): 229-232. https://doi.org/10.11896/j.issn.1002-137X.2014.08.049

SHI Heng-li,LIU Liang-liang,WANG Shi,FU Jian-hui,ZHANG Zai-yue and CAO Cun-gen. Research on Method of Constructing Chinese Character Confusion Set[J]. Computer Science, 2014, 41(8): 229-232. https://doi.org/10.11896/j.issn.1002-137X.2014.08.049

参考文献

[1] 刘亮亮,王石,王东升,等.领域问答系统中的文本错误自动发现方法[J]．中文信息学报,2013,3:77
[2] 张磊,周明,黄昌宁,等.中文文本自动校对[J].语言文字应用,2001(1):19
[3] 陈笑蓉,秦进,汪维家,等.中文文本校对技术的研究与实现[J].计算机科学,2003,1(16):53
[4] Zhang Zhao-huang．A Pilot Study on Automatic Chinese Spelling Error Correction[J]．Communication of COLIPS,1994,4(2):143
[5] 于勐,姚天顺.一种混合的中文文本校对方法[J]．中文信息学报,1998,12(2):31
[6] 丰强泽,曹存根.语音查询中的辨音方法:中国,CN1514387[P].2004-07-21
[7] 戴耿毅,余静涛.基于双数组Trie树算法的字典改进和实现[J].软件导刊,2012,1(7):17
[8] 李慧,杨炳儒,潘丽芳,等.一种基于双数组Trie的B2B规则串提取方法[J].计算机科学,2013,0(5):206
[9] 王静帆,邬晓钧,夏云庆,等.中文信息检索系统的模糊匹配算法的研究和实现[J].中文信息学报,2007,1(006):59
[10] 张仰森,曹元大,俞士汶.基于规则与统计相结合的中文文本自动查错模型与算法[J].中文信息学报,2006,0(4):1
[11] 张仰森,丁冰青.中文文本自动校对技术现状及展望[J].中文信息学报,1998,2(3):50
[12] 王贤明,胡智文,谷琼.一种基于随机n-Grams的文本相似度计算方法[J].情报学报,2013,32(7):716
[13] 吴春颖,王士同.基于二元语法的N-最大概率中文粗分模型[J]．计算机应用,2007(12):2902
[14] 张仰森.中文校对系统中纠错知识库的构造及纠错建议的产生算法[J].中文信息学报,2000,5(5):33

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

汉字种子混淆集的构建方法研究

Research on Method of Constructing Chinese Character Confusion Set

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0