基于标签路径的Web结构化数据自动抽取

Computer Science ›› 2013, Vol. 40 ›› Issue (Z6): 141-144.

Automatic Web Structured Data Extraction Based on Tag Path

LI Gui,CHEN Cheng,LI Zheng-yu,HAN Zi-yang,SUN Ping and SUN Huan-liang

Online:2018-11-16 Published:2018-11-16

Abstract

Abstract: This paper introduces a method based on tag path to extract Web structured data．The method gets complete tag path collection by parsing the DOM tree of the Web document．Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted,then taking advantage of features of tag position,we can separate and filter record,finally complete data extraction．Experiments show that this method achieves higher precision and recall than the MDR when dealing with the page containing the structured data.

Key words: Tag path,Extracting structured data,Clustering

LI Gui,CHEN Cheng,LI Zheng-yu,HAN Zi-yang,SUN Ping and SUN Huan-liang. Automatic Web Structured Data Extraction Based on Tag Path[J].Computer Science, 2013, 40(Z6): 141-144.

References

[1] 孙吉贵,刘杰,赵连宇．聚类算法研究[J].计算机研究与发展,2008(19):48-61
[2] Liu Bing．Web Data Mining[M]．愈勇,薛贵荣,韩定一,译.北京:清华大学出版社,2009:291-295
[3] Liu Bing,Grossman R,Zhai Ya-nong．Ming data records in web pages [C]∥Proceedings of the ACM International on Know-ledge Discovery and Data Ming．2003:601-606
[4] Jsoup:Java Html Parser．http://jsoup.org/apidocs/
[5] 李效东,顾毓清．基于DOM的Web信息提取[J].计算机学报,2002,25
[6] Miao G,Tatemura J,Hsiung Wang-pin,et al．Extracting data records from the Web using tag path clustering[C]∥Madrid．2009
[7] Arasu A,Garcia-Molina H．Extracting structured data fromWeb pages[C]∥Proc of ACM SIGMOD International Confe-rence on the Management of Data．2003:337-348
[8] Cafarella M J,Halevy A,Wang D Z,et al．Exploring the power of tables on the tables on the Web [C]∥Proceedings of 34^thInternational Conference on Very Large Data Bases．2008:538-549

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Automatic Web Structured Data Extraction Based on Tag Path

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0