面向超大规模的中文文本N-gram串统计

计算机科学 ›› 2014, Vol. 41 ›› Issue (4): 263-268.

面向超大规模的中文文本N-gram串统计

余一骄,刘芹

华中师范大学语言学系武汉430079;武汉大学计算机学院武汉430072

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受教育部人文社会科学研究项目:逻辑推理与词义匹配相融合的中文网页语义检索技术研究(10YJA740120),湖北省教育厅人文社会科学研究项目:基于语义理解的中文网页检索方法研究(2010b032)资助

N-gram Chinese Characters Counting for Huge Text Corpora

YU Yi-Jiao and LIU Qin

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 中文文本统计软件Cici高效地实现了对超大规模中文文本语料N-gram串频次的统计与检索。通过统计不同规模中文语料库发现,当N等于6时,语料库中包含的不同N-gram汉字串数量最多。根据“句子”的平均长度和数量,可以准确估算语料库中包含的N-gram串数量。根据多数汉字串在语料库中出现频次低于10次的特点,提出对汉字串频次信息实现分段存储与排序,即对频次不超过10的汉字串独立存储,对频次高于10的汉字串进行分段排序与存储。对大规模中文文本应先进行分块统计,然后合并分块统计结果,建议分块规模约为20MB。

关键词: 汉字,N-gram,语料库,排序

Abstract: Counting N-gram Chinese characters of huge text corpora is a challenge for Chinese information processing and Cici was developed to count huge Chinese text corpora efficiently．We found that the number of different Chinese strings is maximal when the length of strings is 6,and the number of strings can be estimated by the average length of sentences．Since most Chinese strings appear no more than 10times in the corpora,the N-gram characters are stored in 13separate files according to their frequency,and only highly used strings are sorted．This strategy speeds up the accounting process dramatically．Due to the limited physical memory,huge Chinese text corpora have to be divided into many blocks,whose size is suggested to be 20MB．Every block is counted separately,and then the block statistic results are merged together．We implemented the algorithm of accounting huge corpora efficiently in personal computer.

Key words: Chinese character,N-gram,Corpora,Sorting

余一骄,刘芹. 面向超大规模的中文文本N-gram串统计[J]. 计算机科学, 2014, 41(4): 263-268. https://doi.org/

YU Yi-Jiao and LIU Qin. N-gram Chinese Characters Counting for Huge Text Corpora[J]. Computer Science, 2014, 41(4): 263-268. https://doi.org/

参考文献

[1] Sun Xiao,Huang De-gan,Song Hai-yu,et al.Chinese new word identification:a latent discriminative model with global features [J]．Journal of Computer Science and Technology,2011,26(1):14-24
[2] Zeng D,Wei Dong-hua,Chau M,et al．Domain-specific Chinese word segmentation using suffix tree and mutual information [J]．Information System Frontier,2011,13:115-125
[3] 余一骄,刘芹．基于语义的中文网页检索[J]．计算机科学,2012,39(8):89-97
[4] CCL语料库．http://ccl.pku.edu.cn:8080/ ccl_corpus
[5] 宋柔．对外汉语教学中的信息资源和信息处理[M]．北京:北京大学出版社,2008
[6] 邹嘉彦,邝蔼儿,路斌,等．汉语共时语料库与追踪语料库:语料库语言学的新方向[J]．中文信息学报,2011,25(6):38-45
[7] 罗琭昕．用统计的方法看“京派”与“海派”小说语言风格差异[J]．现代语文:学术综合版,2012(4):137-141
[8] Nagao M,Mori S．A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese [C]∥Proceedings of the 15th International Conference on Computational Linguistics．1994:611-615
[9] 张民,李生,赵铁军．大规模汉语语料库中任意n的n-gram统计算法及知识获取方法 [J]．情报学报,1997,16(1):27-34
[10] Yamamoto M,Churcht K W．Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus [J]．Computational Linguistics,2001,27(1):1-40
[11] Banerjee S,Pedersen T．The Design,Implementation,and Use of the N-gram Statistics Package [C]∥Proceedings of CICLing 2003．2003:370-381
[12] N-Gram Extraction Tools ．http:// homepages.inf.ed.ac.uk/lzhang10/ngram.html
[13] Zhang Wei,Yang Lin-cong,Sun Xing-ming,et al.An Effective Method of Arbitrary Length N-gram Statistics for Chinese Text [J]．International Journal of Digital Content Technology and its Applications,2011,5(3):143-155
[14] Jun Da．A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction [C]∥Proceedings of the 4th International Conference on New Technologies in Teaching and Learning Chinese．Beijing:Tsinghua University Press,2004:501-511
[15] Zhang Hong,Xu Bo,Huang Tai-yi．Statistical Analysis of Chinese Language and Language,Modeling Based on Huge Text Corpora [C]∥Proceedings of ICIM 2000．Berlin:Springer-Verlag:279-286
[16] Yang S,Zhu Hong-jun,Ariel A,et al.N-gram Statistics in English and Chinese:Similarities and Differences [C]∥Proceedings of International Conference on Semantic Computing 2007．Wa-shington:IEEE Computer Society,2007:454-460
[17] 王惠．词义·词长·词频—《现代汉语词典》(第5版)多义词计量分析[J]．中国语文,2009(2):120-130

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed