计算机科学 ›› 2014, Vol. 41 ›› Issue (5): 270-274.doi: 10.11896/j.issn.1002-137X.2014.05.057

• 人工智能 • 上一篇    下一篇

基于逐层剪枝的中文高频重复模式快速提取算法

张海军,刘战东,木妮娜   

  1. 新疆师范大学初等教育学院 乌鲁木齐830054;新疆师范大学计算机科学技术学院 乌鲁木齐830054;新疆师范大学计算机科学技术学院 乌鲁木齐830054
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金项目(61163045,4),新疆维吾尔自治区高校科研基金(XJEDU2012S29),新疆师范大学重点学科招标课题(12XSXZ0601)资助

Rapid Algorithm of Chinese High-frequency Repeat Extraction Based on Hierarchical Pruning

ZHANG Hai-jun,LIU Zhan-dong and Munina   

  • Online:2018-11-14 Published:2018-11-14

摘要: 为了从大规模语料中快速提取高频重复模式,以递增n-gram模型为基础,使用散列数据结构提取重复串,并提出了一种基于低频字符和层次剪枝的逐层剪枝算法,用于过滤低频垃圾字串,减少I/O读写次数。在此基础上,应用改进的字串排序算法,使字符串排序可在O(n)时间内完成,从而有效提高重复模式的提取效率。实验表明,该算法是一种有效的重复模式提取算法,其I/O读写次数同语料规模呈线性关系,远小于使用首字符进行语料划分的方法,能快速有效地从规模远大于内存容量的文本语料中提取重复模式,特别适合于大规模语料的高频重复模式提取,对以重复模式为基础的新词识别、术语抽取等具有重要的支撑作用。

关键词: 重复串,散列表,低频字串,逐层剪枝,新词识别

Abstract: To extract high-frequency repeats from large-scale corpus,by using the hash table structure,this paper put forward a hierarchical pruning algorithm based on low-frequency character filtration and Cascade Pruning to filtrate low-frequency strings and to reduce the times of I/O reading & writing.On this basis,this paper employed the improved string sort algorithm,which can implement string sort in O(n) time complexity,to improve the efficiency of repeat extraction.According to the experimental data,it can be concluded that this algorithm is an effective method of repeat extraction,in which the relationship between the times of I/O reading & writing and the scale of corpus is linear.The algorithm can effectively extract repeats from text corpus whose scale is much larger than that of the computer memory and can better suport the repeat-based applications such as new words identification,term extraction,etc.

Key words: Repeat,Hash table,Low-frequency strings,Hierarchical pruning,New words identification

[1] 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19
[2] 邹纲,刘洋,刘群,等.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9
[3] 张海军,史树敏,朱朝勇,等.中文新词识别技术综述[J].计算机科学,2010,37(3):6-10
[4] 龚才春,贺敏,陈海强,等.大规模语料的频繁模式快速发现算法[J].通信学报,2007,28(12):161-166
[5] Nevill-Manning C G,Witten I H.Identifying Hierarchical Structure in Sequences:A Linear-Time Algorithm[J].Journal of Artificial Intelligence Research,1997,7(1):67-82
[6] Yamamoto M,Churcht K W.Using Suffix Arrays to ComputeTerm Frequency and Document Frequency for All Substrings in a Corpus[J].Computational Linguistics,2001,27(1):1-30
[7] Larsson N J,Sadakane K.Faster Suffix Sorting[D].Lund,Sweden:Department of Computer Science,Lund University,1999
[8] Martin F-C,Paolo F,S M.On The Sorting Complexity of Suffix-Tree Construction[J].Journal of ACM,2000,47(6):987-1011
[9] Anisa A H,Maxime C,Lucian I,et al.A comparison of index-based lempel-Ziv LZ77factorization algorithms[J].ACM COMPUTING SerVey,2012,5(1):5
[10] 郑家恒,李文花.基于构词法的网络新词自动识别初探[J].山西大学学报:自然科学版,2002,25(2):115-119
[11] Clifford R,Sergot M.Distributed and Paged Suffix-Trees forLarge Genetic Databases[C]∥ Proceedings of 14th AnnualSymposium on Combinatorial Pattern Matching.2003:70-82
[12] Schurmann K-B,Stoye J.Suffix Tree Construction and Storage with Limited Main Memory[D].Bielefeld,Germany:University of Bielefeld,2003
[13] Tian Y,Tata S,Hankins R A,et al.Practical Methods for Constructing Suffix Trees[J].The VLDB Journal,2005,14(3):281-299
[14] Cormen T H,Leiserson C E,Rivest R L,et al.In Introduction to Algorithms (2nd Ed) [M].Cambridge,MA:MIT Press,2001
[15] 张海军,潘伟民,木妮娜,等.一种自定义顺序的字符串排序算法[J].小型微型计算机系统,2012,33(9):1968-1971

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!