计算机科学 ›› 2014, Vol. 41 ›› Issue (10): 276-282.doi: 10.11896/j.issn.1002-137X.2014.10.058

• 人工智能 • 上一篇    下一篇

基于大规模语料库的高频汉字串互信息分布规律分析

余一骄,尹燕飞,刘芹   

  1. 华中师范大学语言学系 武汉430079;华中师范大学语言学系 武汉430079;武汉大学计算机学院 武汉430072
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受教育部人文社会科学研究项目:逻辑推理与词义匹配相融合的中文网页语义检索技术研究(10YJA740120),湖北省教育厅人文社会科学研究项目:基于语义理解的中文网页检索方法研究(2010b032)资助

Mutual Information Distribution of Frequent N-gram Chinese Characters

YU Yi-jiao,YIN Yan-fei and LIU Qin   

  • Online:2018-11-14 Published:2018-11-14

摘要: 基于互信息的词典构建和自动分词是典型的基于统计的中文信息处理技术。通过计算大规模中文文本语料库的高频二字串、三字串、四字串的互信息发现:第一,高频词的互信息并不是很高,词和短语之间的互信息分布不存在明显界限;第二,高频无效汉字串的互信息与词和短语的互信息也没有明确界限,词、短语、无效汉字串互信息的夹杂分布,使得仅凭汉字串的互信息或频率很难高效地自动标注词、短语以及无效串。以上规律说明:单纯依赖对大规模真实文本语料库进行统计来实现高效的中文词典构建、自动分词处理等会面临极大挑战。

Abstract: Mutual information based Chinese word segmentation and new terms extraction are typical statistics-based Chinese information processing technologies in recent 20 years.This paper discussed the mutual information distribution characteristics of frequent 2-gram,3-gram and 4-gram Chinese characters in a large corpus.The statistic results show two obvious findings as follows.First,there are no evident mutual information boundaries between Chinese word and phrase,which means it is impossible to distinguish Chinese words and phrases with either mutual information or frequency.Second,the mutual information of words,phrases and illegal Chinese strings are mixed together,which drama-tically affects the precision of statistics-based Chinese information processing technology.These two findings show that Chinese word extraction and segmentation only based on statistic technology still face great challenges.

[1] Sproat R,Gale W,Shih C,et al.A stochastic finite-state word-segmentation algorithm for Chinese [J].Computational Linguistics,1996,22(3):377-404
[2] Chen Ai-tao,He Jian-zhang,Xu Liang-jie,et al.Chinese text retrieval without using a dictionary [C]∥Proceedings of SIGIR 1997.ACM Press,1997:43-49
[3] 孙茂松.“取决”与“来源”小议[J].中国语文,1998,267:414-416
[4] Xue Nian-wen.Chinese word segmentation as character tagging [J].Computational Linguistics and Chinese Language Proces-sing,2003,8(1):29-48
[5] Sun Xiao,Huang De-gan,Song Hai-yu,et al.Chinese new word identification:a latent discriminative model with global features [J].Journal of Computer Science and Technology,2011,26(1):14-24
[6] Zeng D,Wei Dong-hua,Chau M,et al.Domain-specific Chinese word segmentation using suffix tree and mutual information [J].Information System Frontier,2011,13:115-125
[7] Peng Fu-chun,Schuurmans D.Self-Supervised Chinese WordSegmentation [C]∥Proceedings of Advances in Intelligent Data Analysis 2001.Springer-Verlag Press,2001:238-247
[8] Gao Jian-feng,Li Mu,Wu An-di,et al.Chinese Word Segmentation and Named Entity Recognition:A Pragmatic Approach [J].Journal of Computational Linguistics,2005,31(4):531-574
[9] Tan Bin,Peng Fu-chun.Unsupervised query segmentation usinggenerative language models and wikipedia [C]∥Proceedings of WWW 2008.ACM Press,2008:347-356
[10] Peng Huan-chuan,Long Fu-hui,Ding C.Feature selection based on mutual information criteria of max-dependency,max-relevance,and min-redundancy Pattern [J].IEEE Transactions on Analysis and Machine Intelligence,2005,27(8):1226-1238
[11] 中华人民共和国国家标准.GB12200.1-90:信息处理用现代汉语分词规范[S].1992
[12] 中华人民共和国国家标准.GB/T 13715-92:信息处理用现代汉语分词规范[S].1992
[13] 余一骄,刘芹.面向超大规模的中文文本N-gram串统计[J].计算机科学,2014,41(4):263-268
[14] 王惠.词义·词长·词频—《现代汉语词典》(第5版)多义词计量分析[J].中国语文,2009,329:120-130
[15] 现代汉语大辞典(第六版)[M].北京:商务印书馆,2012
[16] Ward K,Church K,Hanks P.Word Association Norms,Mutual Information,and Lexicography [J].Computational Linguistics,1990,16(1):22-29
[17] Chang Jing-shin,Lin Yi-chung,Su K.Automatic Construction of a Chinese Electronic Dictionary [C]∥Proceedings of third workshop on Very Large Corpora.MIT Press,1995:107-120
[18] Independence (probability theory) .http://en.wikiped ia.org /wiki/Independence_(probability_theory)
[19] 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!