基于大规模语料库的高频汉字串互信息分布规律分析

doi:10.11896/j.issn.1002-137X.2014.10.058

Abstract

Abstract: Mutual information based Chinese word segmentation and new terms extraction are typical statistics-based Chinese information processing technologies in recent 20 years．This paper discussed the mutual information distribution characteristics of frequent 2-gram,3-gram and 4-gram Chinese characters in a large corpus．The statistic results show two obvious findings as follows．First,there are no evident mutual information boundaries between Chinese word and phrase,which means it is impossible to distinguish Chinese words and phrases with either mutual information or frequency．Second,the mutual information of words,phrases and illegal Chinese strings are mixed together,which drama-tically affects the precision of statistics-based Chinese information processing technology．These two findings show that Chinese word extraction and segmentation only based on statistic technology still face great challenges.

YU Yi-jiao, YIN Yan-fei and LIU Qin. Mutual Information Distribution of Frequent N-gram Chinese Characters[J].Computer Science, 2014, 41(10): 276-282.

0
/ / Recommend

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

URL: https://www.jsjkx.com/EN/10.11896/j.issn.1002-137X.2014.10.058

https://www.jsjkx.com/EN/Y2014/V41/I10/276

References

[1] Sproat R,Gale W,Shih C,et al.A stochastic finite-state word-segmentation algorithm for Chinese [J]．Computational Linguistics,1996,22(3):377-404
[2] Chen Ai-tao,He Jian-zhang,Xu Liang-jie,et al.Chinese text retrieval without using a dictionary [C]∥Proceedings of SIGIR 1997．ACM Press,1997:43-49
[3] 孙茂松．“取决”与“来源”小议[J]．中国语文,1998,267:414-416
[4] Xue Nian-wen．Chinese word segmentation as character tagging [J]．Computational Linguistics and Chinese Language Proces-sing,2003,8(1):29-48
[5] Sun Xiao,Huang De-gan,Song Hai-yu,et al.Chinese new word identification:a latent discriminative model with global features [J]．Journal of Computer Science and Technology,2011,26(1):14-24
[6] Zeng D,Wei Dong-hua,Chau M,et al．Domain-specific Chinese word segmentation using suffix tree and mutual information [J]．Information System Frontier,2011,13:115-125
[7] Peng Fu-chun,Schuurmans D．Self-Supervised Chinese WordSegmentation [C]∥Proceedings of Advances in Intelligent Data Analysis 2001．Springer-Verlag Press,2001:238-247
[8] Gao Jian-feng,Li Mu,Wu An-di,et al．Chinese Word Segmentation and Named Entity Recognition:A Pragmatic Approach [J]．Journal of Computational Linguistics,2005,31(4):531-574
[9] Tan Bin,Peng Fu-chun．Unsupervised query segmentation usinggenerative language models and wikipedia [C]∥Proceedings of WWW 2008．ACM Press,2008:347-356
[10] Peng Huan-chuan,Long Fu-hui,Ding C．Feature selection based on mutual information criteria of max-dependency,max-relevance,and min-redundancy Pattern [J]．IEEE Transactions on Analysis and Machine Intelligence,2005,27(8):1226-1238
[11] 中华人民共和国国家标准．GB12200.1-90:信息处理用现代汉语分词规范[S]．1992
[12] 中华人民共和国国家标准．GB/T 13715-92:信息处理用现代汉语分词规范[S]．1992
[13] 余一骄,刘芹．面向超大规模的中文文本N-gram串统计[J]．计算机科学,2014,41(4):263-268
[14] 王惠．词义·词长·词频—《现代汉语词典》(第5版)多义词计量分析[J]．中国语文,2009,329:120-130
[15] 现代汉语大辞典(第六版)[M]．北京:商务印书馆,2012
[16] Ward K,Church K,Hanks P．Word Association Norms,Mutual Information,and Lexicography [J]．Computational Linguistics,1990,16(1):22-29
[17] Chang Jing-shin,Lin Yi-chung,Su K．Automatic Construction of a Chinese Electronic Dictionary [C]∥Proceedings of third workshop on Very Large Corpora．MIT Press,1995:107-120
[18] Independence (probability theory) ．http://en.wikiped ia.org /wiki/Independence_(probability_theory)
[19] 黄昌宁,赵海．中文分词十年回顾[J]．中文信息学报,2007,21(3):8-19

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Mutual Information Distribution of Frequent N-gram Chinese Characters

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0