基于页面标签的Web结构化数据抽取

计算机科学 ›› 2007, Vol. 34 ›› Issue (10): 133-136.

• 软件工程与数据库技术 • 上一篇下一篇

基于页面标签的Web结构化数据抽取

任仲晟薛永生

厦门大学计算机科学系,厦门361005

出版日期:2018-11-16 发布日期:2018-11-16
基金资助:
国家自然科学基金（50474033）、福建省自然科学基金（A0310008）、福建省重点科技项目（2003H043）.

REN Zhong-Sheng ,XUE Yong-Sheng （Department of Computer Science, Xiamen University, Xiamen 361005）

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 本文研究了从dataintensive类型的web页面中提取结构化数据的问题，提出了基于页面标签的数据抽取算法。该算法先根据标签的显示位置及其大小判断不同标签元素之间的嵌套关系，并构造简化的HTML树Sim-HTree，有效地减少了识别数据记录的时间。在此基础上，提出子串匹配调整算法，对数据记录进行识别，标识数据项。实验表明，该算法是有效的。

关键词: Web数据抽取 Web挖掘结构化数据信息抽取

Abstract: This paper studies the problem of structured data extraction from data intensive Web pages. A novel approach based on page tags is proposed to solve the problem. The proposed method identifies the nesting relationship between different page tags according

Key words: Web data extraction, Web mining, Structured data, Information extraction

任仲晟薛永生. 基于页面标签的Web结构化数据抽取[J]. 计算机科学, 2007, 34(10): 133-136. https://doi.org/

REN Zhong-Sheng ,XUE Yong-Sheng （Department of Computer Science, Xiamen University, Xiamen 361005）. [J]. Computer Science, 2007, 34(10): 133-136. https://doi.org/

参考文献

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于页面标签的Web结构化数据抽取

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0