Computer Science ›› 2014, Vol. 41 ›› Issue (11): 256-259.doi: 10.11896/j.issn.1002-137X.2014.11.049

Previous Articles     Next Articles

Research and Application on Auto-word Building

WANG Jian-quan and JI Shao-bo   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Words are the basic elements of Chinese text,and Chinese language model plays a key role in Chinese text mining.Text classification is a data mining technology with high dimensions and most of the classifying algorithms are sensitive to the dimensions.As a result,the classification depends on the quantity of vocabularies.Besides,most of current Chinese language models are based on statistical theory,such as N-gram model and other improved models.Howe-ver,these statistical models are disadvantaged with computational complexity.In order to improve the quantity and efficiency,this paper gave Chinese words a new definition based on association rules,and proposed the Auto-word algorithm,by which a word vocabulary is constructed automatically and used for Chinese text mining.Finally,the efficiency of the Auto-word algorithm was proved by experiment.

Key words: Constructing words automatically,Statistical language model,Association rules,Longest common subsequence,Text classification

[1] 苏菲,王丹力,戴国忠.基于标记的规则统计模型与未登录词识别算法[J].计算机工程与应用,2004(15):43-45
[2] 李伟,吴及,吕萍.基于前后向语言模型的语音识别词图生成算法[J].计算机应用,2010,30(10):2563-2566
[3] 刘群.统计机器翻译综述[J].中文信息学报,2003,17(4):1-12
[4] 张苗,张德贤.多类支持向量机文本分类方法[J].计算机技术与发展,2008,18(3):139-141
[5] 刘红岩,陈剑,陈国青.数据挖掘中的数据分类算法综述[J].清华大学学报:自然科学版,2002,42(6):727-730
[6] 张启宇,朱玲,张雅萍.中文分词算法研究综述[J].情报探索,2008(11):53-56
[7] 俞士汶,朱学锋,王惠,等.现代汉语语法信息词典详解[M].北京:清华大学出版社,1998
[8] 肖镜辉,刘秉权,王晓龙.面向汉语建模的自适应词表生成算法[J].自动化学报,2008,24(1):40-47
[9] 刘君强,孙晓莹,潘云鹤.关联规则挖掘技术研究的新进展[J].计算机科学,2004,31(1):110-113
[10] Agrawal R,Srikant R.Fast algorithms for mining association rules[C]∥Proc.20th Int.Conf.Very Large Data Bases(VLDB).1994,1215:487-499
[11] Amir A,Feldman R,Kashi R.A new and versatile method for association generation [M]∥Principles of Data Mining and Knowledge Discovery.Springer Berlin Heidelberg,1997:221-231
[12] 王映龙,杨炳儒,宋泽锋,等.基因序列相似程度的LCS算法研究[J].计算机工程与应用,2007,41(31):45-47
[13] Fung P.Extracting key terms from Chinese and Japanese texts[J].Computer Processing of Oriental Languages,1998,12(1):99-121
[14] 程苗,陈华平.基于Hadoop的Web日志挖掘[J].计算机工程,2011,37(11):37-39

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!