面向超大规模的中文文本N-gram串统计

Abstract

Abstract: Counting N-gram Chinese characters of huge text corpora is a challenge for Chinese information processing and Cici was developed to count huge Chinese text corpora efficiently．We found that the number of different Chinese strings is maximal when the length of strings is 6,and the number of strings can be estimated by the average length of sentences．Since most Chinese strings appear no more than 10times in the corpora,the N-gram characters are stored in 13separate files according to their frequency,and only highly used strings are sorted．This strategy speeds up the accounting process dramatically．Due to the limited physical memory,huge Chinese text corpora have to be divided into many blocks,whose size is suggested to be 20MB．Every block is counted separately,and then the block statistic results are merged together．We implemented the algorithm of accounting huge corpora efficiently in personal computer.

Key words: Chinese character,N-gram,Corpora,Sorting

YU Yi-Jiao and LIU Qin. N-gram Chinese Characters Counting for Huge Text Corpora[J].Computer Science, 2014, 41(4): 263-268.

References

[1] Sun Xiao,Huang De-gan,Song Hai-yu,et al.Chinese new word identification:a latent discriminative model with global features [J]．Journal of Computer Science and Technology,2011,26(1):14-24
[2] Zeng D,Wei Dong-hua,Chau M,et al．Domain-specific Chinese word segmentation using suffix tree and mutual information [J]．Information System Frontier,2011,13:115-125
[3] 余一骄,刘芹．基于语义的中文网页检索[J]．计算机科学,2012,39(8):89-97
[4] CCL语料库．http://ccl.pku.edu.cn:8080/ ccl_corpus
[5] 宋柔．对外汉语教学中的信息资源和信息处理[M]．北京:北京大学出版社,2008
[6] 邹嘉彦,邝蔼儿,路斌,等．汉语共时语料库与追踪语料库:语料库语言学的新方向[J]．中文信息学报,2011,25(6):38-45
[7] 罗琭昕．用统计的方法看“京派”与“海派”小说语言风格差异[J]．现代语文:学术综合版,2012(4):137-141
[8] Nagao M,Mori S．A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese [C]∥Proceedings of the 15th International Conference on Computational Linguistics．1994:611-615
[9] 张民,李生,赵铁军．大规模汉语语料库中任意n的n-gram统计算法及知识获取方法 [J]．情报学报,1997,16(1):27-34
[10] Yamamoto M,Churcht K W．Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus [J]．Computational Linguistics,2001,27(1):1-40
[11] Banerjee S,Pedersen T．The Design,Implementation,and Use of the N-gram Statistics Package [C]∥Proceedings of CICLing 2003．2003:370-381
[12] N-Gram Extraction Tools ．http:// homepages.inf.ed.ac.uk/lzhang10/ngram.html
[13] Zhang Wei,Yang Lin-cong,Sun Xing-ming,et al.An Effective Method of Arbitrary Length N-gram Statistics for Chinese Text [J]．International Journal of Digital Content Technology and its Applications,2011,5(3):143-155
[14] Jun Da．A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction [C]∥Proceedings of the 4th International Conference on New Technologies in Teaching and Learning Chinese．Beijing:Tsinghua University Press,2004:501-511
[15] Zhang Hong,Xu Bo,Huang Tai-yi．Statistical Analysis of Chinese Language and Language,Modeling Based on Huge Text Corpora [C]∥Proceedings of ICIM 2000．Berlin:Springer-Verlag:279-286
[16] Yang S,Zhu Hong-jun,Ariel A,et al.N-gram Statistics in English and Chinese:Similarities and Differences [C]∥Proceedings of International Conference on Semantic Computing 2007．Wa-shington:IEEE Computer Society,2007:454-460
[17] 王惠．词义·词长·词频—《现代汉语词典》(第5版)多义词计量分析[J]．中国语文,2009(2):120-130

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

N-gram Chinese Characters Counting for Huge Text Corpora

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0