Computer Science ›› 2013, Vol. 40 ›› Issue (Z6): 141-144.

Previous Articles     Next Articles

Automatic Web Structured Data Extraction Based on Tag Path

LI Gui,CHEN Cheng,LI Zheng-yu,HAN Zi-yang,SUN Ping and SUN Huan-liang   

  • Online:2018-11-16 Published:2018-11-16

Abstract: This paper introduces a method based on tag path to extract Web structured data.The method gets complete tag path collection by parsing the DOM tree of the Web document.Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted,then taking advantage of features of tag position,we can separate and filter record,finally complete data extraction.Experiments show that this method achieves higher precision and recall than the MDR when dealing with the page containing the structured data.

Key words: Tag path,Extracting structured data,Clustering

[1] 孙吉贵,刘杰,赵连宇.聚类算法研究[J].计算机研究与发展,2008(19):48-61
[2] Liu Bing.Web Data Mining[M].愈勇,薛贵荣,韩定一,译.北京:清华大学出版社,2009:291-295
[3] Liu Bing,Grossman R,Zhai Ya-nong.Ming data records in web pages [C]∥Proceedings of the ACM International on Know-ledge Discovery and Data Ming.2003:601-606
[4] Jsoup:Java Html Parser.http://jsoup.org/apidocs/
[5] 李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25
[6] Miao G,Tatemura J,Hsiung Wang-pin,et al.Extracting data records from the Web using tag path clustering[C]∥Madrid.2009
[7] Arasu A,Garcia-Molina H.Extracting structured data fromWeb pages[C]∥Proc of ACM SIGMOD International Confe-rence on the Management of Data.2003:337-348
[8] Cafarella M J,Halevy A,Wang D Z,et al.Exploring the power of tables on the tables on the Web [C]∥Proceedings of 34thInternational Conference on Very Large Data Bases.2008:538-549

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!