计算机科学 ›› 2014, Vol. 41 ›› Issue (11): 256-259.doi: 10.11896/j.issn.1002-137X.2014.11.049

• 人工智能 • 上一篇    下一篇

基于关联规则的自动构词算法研究

王鉴全,季绍波   

  1. 大连理工大学管理经济学部 大连116023;大连理工大学管理经济学部 大连116023
  • 出版日期:2018-11-14 发布日期:2018-11-14

Research and Application on Auto-word Building

WANG Jian-quan and JI Shao-bo   

  • Online:2018-11-14 Published:2018-11-14

摘要: 词语是中文文本的基本元素,汉语语言模型在中文文本挖掘中起关键作用。中文文本挖掘是高维度的数据处理技术,挖掘算法对维度的大小比较敏感,因此挖掘效果依赖于词库的质量。另外,现存的汉语语言模型一般都是基于统计的,比如N-gram语言模型以及各种改进模型都具有较高的计算复杂度。为降低语言模型的计算复杂度、提高词库的质量和构词效率,借鉴关联规则理论对中文词语进行定义,在此基础上构建Auto-word自动构词算法。该算法可以从大量中文语料库中动态地构造词表,并以此为基础进行中文文本挖掘工作。最后通过实验证明了提出的自动构词算法的有效性。

关键词: 自动构词,统计语言模型,关联规则,最长公共子序列,文本分类

Abstract: Words are the basic elements of Chinese text,and Chinese language model plays a key role in Chinese text mining.Text classification is a data mining technology with high dimensions and most of the classifying algorithms are sensitive to the dimensions.As a result,the classification depends on the quantity of vocabularies.Besides,most of current Chinese language models are based on statistical theory,such as N-gram model and other improved models.Howe-ver,these statistical models are disadvantaged with computational complexity.In order to improve the quantity and efficiency,this paper gave Chinese words a new definition based on association rules,and proposed the Auto-word algorithm,by which a word vocabulary is constructed automatically and used for Chinese text mining.Finally,the efficiency of the Auto-word algorithm was proved by experiment.

Key words: Constructing words automatically,Statistical language model,Association rules,Longest common subsequence,Text classification

[1] 苏菲,王丹力,戴国忠.基于标记的规则统计模型与未登录词识别算法[J].计算机工程与应用,2004(15):43-45
[2] 李伟,吴及,吕萍.基于前后向语言模型的语音识别词图生成算法[J].计算机应用,2010,30(10):2563-2566
[3] 刘群.统计机器翻译综述[J].中文信息学报,2003,17(4):1-12
[4] 张苗,张德贤.多类支持向量机文本分类方法[J].计算机技术与发展,2008,18(3):139-141
[5] 刘红岩,陈剑,陈国青.数据挖掘中的数据分类算法综述[J].清华大学学报:自然科学版,2002,42(6):727-730
[6] 张启宇,朱玲,张雅萍.中文分词算法研究综述[J].情报探索,2008(11):53-56
[7] 俞士汶,朱学锋,王惠,等.现代汉语语法信息词典详解[M].北京:清华大学出版社,1998
[8] 肖镜辉,刘秉权,王晓龙.面向汉语建模的自适应词表生成算法[J].自动化学报,2008,24(1):40-47
[9] 刘君强,孙晓莹,潘云鹤.关联规则挖掘技术研究的新进展[J].计算机科学,2004,31(1):110-113
[10] Agrawal R,Srikant R.Fast algorithms for mining association rules[C]∥Proc.20th Int.Conf.Very Large Data Bases(VLDB).1994,1215:487-499
[11] Amir A,Feldman R,Kashi R.A new and versatile method for association generation [M]∥Principles of Data Mining and Knowledge Discovery.Springer Berlin Heidelberg,1997:221-231
[12] 王映龙,杨炳儒,宋泽锋,等.基因序列相似程度的LCS算法研究[J].计算机工程与应用,2007,41(31):45-47
[13] Fung P.Extracting key terms from Chinese and Japanese texts[J].Computer Processing of Oriental Languages,1998,12(1):99-121
[14] 程苗,陈华平.基于Hadoop的Web日志挖掘[J].计算机工程,2011,37(11):37-39

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!