Computer Science ›› 2011, Vol. 38 ›› Issue (8): 165-168.

Previous Articles     Next Articles

Indent Shape Based Approach for Mining Repeated Patterns of HTML Documents

ZHU Yan-xu,WANG Huai-min,SHI Dian-x,YIN Gang,YUAN Lin, LI Xiang   

  • Online:2018-11-16 Published:2018-11-16

Abstract: Mining repeated patterns is the key to find encoding templates of Web pages, which is the basis for automatic Web data extraction and Web content mining. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for massive Web pages processing. In order to improve performance,the paper presented a novel indent shape based approach for mining repeated patterns of HTML documents. Firstly, the approach defines the indent shape model, which is a kind of simplified abstraction of HTML documents consisting of indents and first tags of each line; Then, it detects repeated patterns indirectly by identifying tandem repeated waves from indent shape. Extensive experiments show that our approach achieves better performance compared with existing approaches.

Key words: Mining repeated patterns, Web data extraction, Web content mining,Indent shape, Tandem repeated waves

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!