基于标签路径的Web结构化数据自动抽取

计算机科学 ›› 2013, Vol. 40 ›› Issue (Z6): 141-144.

基于标签路径的Web结构化数据自动抽取

李贵,陈成,李征宇,韩子扬,孙平,孙焕良

沈阳建筑大学信息与控制工程系沈阳110168;沈阳建筑大学信息与控制工程系沈阳110168;沈阳建筑大学信息与控制工程系沈阳110168;沈阳建筑大学信息与控制工程系沈阳110168;沈阳建筑大学信息与控制工程系沈阳110168;沈阳建筑大学信息与控制工程系沈阳110168

出版日期:2018-11-16 发布日期:2018-11-16
基金资助:
本文受国家自然科学基金(61070024)资助

Automatic Web Structured Data Extraction Based on Tag Path

LI Gui,CHEN Cheng,LI Zheng-yu,HAN Zi-yang,SUN Ping and SUN Huan-liang

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 提出了一种基于标签路径的Web结构化数据自动抽取方法。该方法通过对网页DOM树的解析获取完整标签路径集合,并依据路径相似度测量方法来聚类标签路径,实现目标数据区域的定位,然后通过标签节点坐标位置的特性来分离各个数据项,过滤冗余数据,最终完成数据抽取。实验结果表明,与MDR方法相比,该方法在处理拥有结构化数据的网页时,有更高的查全率和查准率。

关键词: 标签路径,结构化数据抽取,聚类

Abstract: This paper introduces a method based on tag path to extract Web structured data．The method gets complete tag path collection by parsing the DOM tree of the Web document．Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted,then taking advantage of features of tag position,we can separate and filter record,finally complete data extraction．Experiments show that this method achieves higher precision and recall than the MDR when dealing with the page containing the structured data.

Key words: Tag path,Extracting structured data,Clustering

李贵,陈成,李征宇,韩子扬,孙平,孙焕良. 基于标签路径的Web结构化数据自动抽取[J]. 计算机科学, 2013, 40(Z6): 141-144. https://doi.org/

LI Gui,CHEN Cheng,LI Zheng-yu,HAN Zi-yang,SUN Ping and SUN Huan-liang. Automatic Web Structured Data Extraction Based on Tag Path[J]. Computer Science, 2013, 40(Z6): 141-144. https://doi.org/

参考文献

[1] 孙吉贵,刘杰,赵连宇．聚类算法研究[J].计算机研究与发展,2008(19):48-61
[2] Liu Bing．Web Data Mining[M]．愈勇,薛贵荣,韩定一,译.北京:清华大学出版社,2009:291-295
[3] Liu Bing,Grossman R,Zhai Ya-nong．Ming data records in web pages [C]∥Proceedings of the ACM International on Know-ledge Discovery and Data Ming．2003:601-606
[4] Jsoup:Java Html Parser．http://jsoup.org/apidocs/
[5] 李效东,顾毓清．基于DOM的Web信息提取[J].计算机学报,2002,25
[6] Miao G,Tatemura J,Hsiung Wang-pin,et al．Extracting data records from the Web using tag path clustering[C]∥Madrid．2009
[7] Arasu A,Garcia-Molina H．Extracting structured data fromWeb pages[C]∥Proc of ACM SIGMOD International Confe-rence on the Management of Data．2003:337-348
[8] Cafarella M J,Halevy A,Wang D Z,et al．Exploring the power of tables on the tables on the Web [C]∥Proceedings of 34^thInternational Conference on Very Large Data Bases．2008:538-549

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed