计算机科学 ›› 2014, Vol. 41 ›› Issue (5): 270-274.doi: 10.11896/j.issn.1002-137X.2014.05.057
张海军,刘战东,木妮娜
ZHANG Hai-jun,LIU Zhan-dong and Munina
摘要: 为了从大规模语料中快速提取高频重复模式,以递增n-gram模型为基础,使用散列数据结构提取重复串,并提出了一种基于低频字符和层次剪枝的逐层剪枝算法,用于过滤低频垃圾字串,减少I/O读写次数。在此基础上,应用改进的字串排序算法,使字符串排序可在O(n)时间内完成,从而有效提高重复模式的提取效率。实验表明,该算法是一种有效的重复模式提取算法,其I/O读写次数同语料规模呈线性关系,远小于使用首字符进行语料划分的方法,能快速有效地从规模远大于内存容量的文本语料中提取重复模式,特别适合于大规模语料的高频重复模式提取,对以重复模式为基础的新词识别、术语抽取等具有重要的支撑作用。
[1] 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19 [2] 邹纲,刘洋,刘群,等.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9 [3] 张海军,史树敏,朱朝勇,等.中文新词识别技术综述[J].计算机科学,2010,37(3):6-10 [4] 龚才春,贺敏,陈海强,等.大规模语料的频繁模式快速发现算法[J].通信学报,2007,28(12):161-166 [5] Nevill-Manning C G,Witten I H.Identifying Hierarchical Structure in Sequences:A Linear-Time Algorithm[J].Journal of Artificial Intelligence Research,1997,7(1):67-82 [6] Yamamoto M,Churcht K W.Using Suffix Arrays to ComputeTerm Frequency and Document Frequency for All Substrings in a Corpus[J].Computational Linguistics,2001,27(1):1-30 [7] Larsson N J,Sadakane K.Faster Suffix Sorting[D].Lund,Sweden:Department of Computer Science,Lund University,1999 [8] Martin F-C,Paolo F,S M.On The Sorting Complexity of Suffix-Tree Construction[J].Journal of ACM,2000,47(6):987-1011 [9] Anisa A H,Maxime C,Lucian I,et al.A comparison of index-based lempel-Ziv LZ77factorization algorithms[J].ACM COMPUTING SerVey,2012,5(1):5 [10] 郑家恒,李文花.基于构词法的网络新词自动识别初探[J].山西大学学报:自然科学版,2002,25(2):115-119 [11] Clifford R,Sergot M.Distributed and Paged Suffix-Trees forLarge Genetic Databases[C]∥ Proceedings of 14th AnnualSymposium on Combinatorial Pattern Matching.2003:70-82 [12] Schurmann K-B,Stoye J.Suffix Tree Construction and Storage with Limited Main Memory[D].Bielefeld,Germany:University of Bielefeld,2003 [13] Tian Y,Tata S,Hankins R A,et al.Practical Methods for Constructing Suffix Trees[J].The VLDB Journal,2005,14(3):281-299 [14] Cormen T H,Leiserson C E,Rivest R L,et al.In Introduction to Algorithms (2nd Ed) [M].Cambridge,MA:MIT Press,2001 [15] 张海军,潘伟民,木妮娜,等.一种自定义顺序的字符串排序算法[J].小型微型计算机系统,2012,33(9):1968-1971 |
No related articles found! |
|