Computer Science ›› 2012, Vol. 39 ›› Issue (3): 153-156.

Previous Articles     Next Articles

Algorithm of Frequent Patterns Finding Based on Large Scale Corpus Partition

DING Xi-yuan,HUANG He-yan,ZHANG Hai-jun,WANG Shu-mei   

  • Online:2018-11-16 Published:2018-11-16

Abstract: Frequent patterns finding is useful for some areas, such as new word recognition, Internet public opinion monitoring, bio-information series detection, etc. Considering that corpus size is much larger than memory capacity, we put forward a pragmatic algorithm to find frectuent patterns. Firstly, corpus was partitioned into multiple sets based on first character of suffix,and then a concept of maximized longest common prefix interval (MLCPI) was introduced,and by means of searching it while scanning data in sets item by item, we accomplished the finding task. Besides, we proposed hierarchical reduction algorithm (HRA) to reduce sulrstring during the finding process on that basis. There is no need to import all data into memory, so it will decrease resource consumption. Moreover, it is found that frequent patterns among sets do not interfere with each other, which will improve the speed while processing paralleled. We used 4. 61 gigabytes plain text as experiment data. The results show that the memory usage is lower than 30 megabytes, and the speed is up to 1. 08 megabytes per seconds, and it is able to reduce sub-string efficiently.

Key words: Frecauent pattern, Repeats, Corpus partition, Sulrstring reduction

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!