计算机科学 ›› 2011, Vol. 38 ›› Issue (8): 165-168.

• 数据库与数据挖掘 • 上一篇    下一篇

基于缩进轮廓的HTML文档重复模式挖掘方法

朱沿旭,王怀民,史殿习,尹刚,袁霖,李翔   

  1. (国防科学技术大学计算机学院 长沙410073);(信息工程大学电子技术学院 郑州450004)
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受国家863计划重点课题(2007AA010301) ,国家自然科学基金(60903043),核高基重大专项课题(2009ZX01043-001)资助。

Indent Shape Based Approach for Mining Repeated Patterns of HTML Documents

ZHU Yan-xu,WANG Huai-min,SHI Dian-x,YIN Gang,YUAN Lin, LI Xiang   

  • Online:2018-11-16 Published:2018-11-16

摘要: HTML文档重复模式挖掘是找到Web页面编码模版的关键,是Web数据自动抽取和Web内容挖掘的基础。传统的基于字符串匹配和树匹配的重复模式挖掘方法虽然具有较高的精确度,但是其性能对于处理海量的Web页面来说仍然是一个挑战。为了提高性能,提出了一种基于缩进轮廓的HTML文档重复模式挖掘方法。该方法首先定义了缩进轮廓模型,是一种由HTML文档每行代码的缩进值及行首的HTML标签构成的数据结构,它是HTML文档的一种简化抽象;该方法通过检测缩进轮廓中的串联重复波段,间接地挖掘HTML文档中的重复模式。实验表明,该方法不但具有较高的精确度,而且较明显地提升了性能。

关键词: 重复模式挖掘,Web数据抽取,Web内容挖掘,缩进轮廓,串联重复波段

Abstract: Mining repeated patterns is the key to find encoding templates of Web pages, which is the basis for automatic Web data extraction and Web content mining. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for massive Web pages processing. In order to improve performance,the paper presented a novel indent shape based approach for mining repeated patterns of HTML documents. Firstly, the approach defines the indent shape model, which is a kind of simplified abstraction of HTML documents consisting of indents and first tags of each line; Then, it detects repeated patterns indirectly by identifying tandem repeated waves from indent shape. Extensive experiments show that our approach achieves better performance compared with existing approaches.

Key words: Mining repeated patterns, Web data extraction, Web content mining,Indent shape, Tandem repeated waves

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!