基于大规模语料划分的频繁模式查找算法

计算机科学 ›› 2012, Vol. 39 ›› Issue (3): 153-156.

基于大规模语料划分的频繁模式查找算法

丁溪源,黄河燕,张海军,王树梅

(南京理工大学计算机科学与技术学院南京210094);(北京理工大学计算机科学技术学院北京100081);(中国科学院计算机语言信息工程研究中心北京100097)

出版日期:2018-11-16 发布日期:2018-11-16

Algorithm of Frequent Patterns Finding Based on Large Scale Corpus Partition

DING Xi-yuan,HUANG He-yan,ZHANG Hai-jun,WANG Shu-mei

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 频繁模式查找对新词识别、网络奥情监测、生物信息序列检测等领域有很高的应用价值。为处理规模远超出内存的语料，提出了一种实用的频繁模式查找算法。先将语料按后缀首字符划分为多个集合，通过逐条扫描集合数据，搜索出最大化最长公共前缀区间(MI_CPI)来完成查找。另外在此基拙上提出逐层归并算法，实现查找的同时归并子串。由于进行查找时无需将全部数据导入内存，因此资源消耗较少;各集合间频繁模式查找互不千扰，可采用并行处理加快运行速度。使用4. 61C}纯文本语料进行了试验，结果表明其内存消耗小于30M，查找速度最快达1. 08M/s，能高效地进行子串归并。

关键词: 频繁模式，重复串，语料划分，子串归并

Abstract: Frequent patterns finding is useful for some areas, such as new word recognition, Internet public opinion monitoring, bio-information series detection, etc. Considering that corpus size is much larger than memory capacity, we put forward a pragmatic algorithm to find frectuent patterns. Firstly, corpus was partitioned into multiple sets based on first character of suffix,and then a concept of maximized longest common prefix interval (MLCPI) was introduced,and by means of searching it while scanning data in sets item by item, we accomplished the finding task. Besides, we proposed hierarchical reduction algorithm (HRA) to reduce sulrstring during the finding process on that basis. There is no need to import all data into memory, so it will decrease resource consumption. Moreover, it is found that frequent patterns among sets do not interfere with each other, which will improve the speed while processing paralleled. We used 4. 61 gigabytes plain text as experiment data. The results show that the memory usage is lower than 30 megabytes, and the speed is up to 1. 08 megabytes per seconds, and it is able to reduce sub-string efficiently.

Key words: Frecauent pattern, Repeats, Corpus partition, Sulrstring reduction

丁溪源,黄河燕,张海军,王树梅. 基于大规模语料划分的频繁模式查找算法[J]. 计算机科学, 2012, 39(3): 153-156. https://doi.org/

DING Xi-yuan,HUANG He-yan,ZHANG Hai-jun,WANG Shu-mei. Algorithm of Frequent Patterns Finding Based on Large Scale Corpus Partition[J]. Computer Science, 2012, 39(3): 153-156. https://doi.org/

参考文献

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed