Computer Science ›› 2014, Vol. 41 ›› Issue (5): 270-274.doi: 10.11896/j.issn.1002-137X.2014.05.057

Previous Articles     Next Articles

Rapid Algorithm of Chinese High-frequency Repeat Extraction Based on Hierarchical Pruning

ZHANG Hai-jun,LIU Zhan-dong and Munina   

  • Online:2018-11-14 Published:2018-11-14

Abstract: To extract high-frequency repeats from large-scale corpus,by using the hash table structure,this paper put forward a hierarchical pruning algorithm based on low-frequency character filtration and Cascade Pruning to filtrate low-frequency strings and to reduce the times of I/O reading & writing.On this basis,this paper employed the improved string sort algorithm,which can implement string sort in O(n) time complexity,to improve the efficiency of repeat extraction.According to the experimental data,it can be concluded that this algorithm is an effective method of repeat extraction,in which the relationship between the times of I/O reading & writing and the scale of corpus is linear.The algorithm can effectively extract repeats from text corpus whose scale is much larger than that of the computer memory and can better suport the repeat-based applications such as new words identification,term extraction,etc.

Key words: Repeat,Hash table,Low-frequency strings,Hierarchical pruning,New words identification

[1] 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19
[2] 邹纲,刘洋,刘群,等.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9
[3] 张海军,史树敏,朱朝勇,等.中文新词识别技术综述[J].计算机科学,2010,37(3):6-10
[4] 龚才春,贺敏,陈海强,等.大规模语料的频繁模式快速发现算法[J].通信学报,2007,28(12):161-166
[5] Nevill-Manning C G,Witten I H.Identifying Hierarchical Structure in Sequences:A Linear-Time Algorithm[J].Journal of Artificial Intelligence Research,1997,7(1):67-82
[6] Yamamoto M,Churcht K W.Using Suffix Arrays to ComputeTerm Frequency and Document Frequency for All Substrings in a Corpus[J].Computational Linguistics,2001,27(1):1-30
[7] Larsson N J,Sadakane K.Faster Suffix Sorting[D].Lund,Sweden:Department of Computer Science,Lund University,1999
[8] Martin F-C,Paolo F,S M.On The Sorting Complexity of Suffix-Tree Construction[J].Journal of ACM,2000,47(6):987-1011
[9] Anisa A H,Maxime C,Lucian I,et al.A comparison of index-based lempel-Ziv LZ77factorization algorithms[J].ACM COMPUTING SerVey,2012,5(1):5
[10] 郑家恒,李文花.基于构词法的网络新词自动识别初探[J].山西大学学报:自然科学版,2002,25(2):115-119
[11] Clifford R,Sergot M.Distributed and Paged Suffix-Trees forLarge Genetic Databases[C]∥ Proceedings of 14th AnnualSymposium on Combinatorial Pattern Matching.2003:70-82
[12] Schurmann K-B,Stoye J.Suffix Tree Construction and Storage with Limited Main Memory[D].Bielefeld,Germany:University of Bielefeld,2003
[13] Tian Y,Tata S,Hankins R A,et al.Practical Methods for Constructing Suffix Trees[J].The VLDB Journal,2005,14(3):281-299
[14] Cormen T H,Leiserson C E,Rivest R L,et al.In Introduction to Algorithms (2nd Ed) [M].Cambridge,MA:MIT Press,2001
[15] 张海军,潘伟民,木妮娜,等.一种自定义顺序的字符串排序算法[J].小型微型计算机系统,2012,33(9):1968-1971

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!