Computer Science ›› 2011, Vol. 38 ›› Issue (8): 165-168.
Previous Articles Next Articles
ZHU Yan-xu,WANG Huai-min,SHI Dian-x,YIN Gang,YUAN Lin, LI Xiang
Online:
Published:
Abstract: Mining repeated patterns is the key to find encoding templates of Web pages, which is the basis for automatic Web data extraction and Web content mining. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for massive Web pages processing. In order to improve performance,the paper presented a novel indent shape based approach for mining repeated patterns of HTML documents. Firstly, the approach defines the indent shape model, which is a kind of simplified abstraction of HTML documents consisting of indents and first tags of each line; Then, it detects repeated patterns indirectly by identifying tandem repeated waves from indent shape. Extensive experiments show that our approach achieves better performance compared with existing approaches.
Key words: Mining repeated patterns, Web data extraction, Web content mining,Indent shape, Tandem repeated waves
ZHU Yan-xu,WANG Huai-min,SHI Dian-x,YIN Gang,YUAN Lin, LI Xiang. Indent Shape Based Approach for Mining Repeated Patterns of HTML Documents[J].Computer Science, 2011, 38(8): 165-168.
0 / / Recommend
Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks
URL: https://www.jsjkx.com/EN/
https://www.jsjkx.com/EN/Y2011/V38/I8/165
Cited