xScraper:基于Web-Harvest技术批量与深度获取无结构化Web信息

计算机科学 ›› 2012, Vol. 39 ›› Issue (12): 149-152.

xScraper:基于Web-Harvest技术批量与深度获取无结构化Web信息

朱炎朱凯

(西南交通大学信息科学与技术学院成都 610031)

出版日期:2018-11-16 发布日期:2018-11-16

xScraper:Bulk- and Deep-extracting Non-structured Web Information Based on Web-Harvest Techniques

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 通过分析Web-Harvest数据提取规则的设计原理，设计实现了一个xScraper系统。该系统的主要功能有: (1)定制设计满足不同需求的Web数据提取规则模板，驱动Web-Harvest内核进行无结构化信息提取;(2)批量可控提取同一网址中的W cb信息(含图像);(3)跨网站深度提取主题相关信息;(4)提取Web信息元数据并将其转换为 XML标签;(5)实现无结构化多媒体信息的数据库管理。应用结果表明，系统提供了超出Web-Harvest的加值功能，可满足不同的信息提取需求，其简单实用，便于扩展。

关键词: Web信息提取，xScraper系统，Web-Harvest内核技术

Abstract: A system named xScraper was developed based on the data extraction rules investigation in Web-Harvest, 5 main functions of this system are(1) flexible specification of extraction rules to meet different application rectuire- menu; (2) controllable bulk non-structured data (incl. images) extraction from the same Web site; (3) deep extraction of topi}rclated information across many Web sites; (4) extraction of metadata from Web sites and transformation in to XML tags; (5) non-structured multi-media information management in databases. xScraper is a simple, practical and ex- tendable system. It provides value-added services over Web-Harvest and can meet different requirements of Web infor- matron extraction.

Key words: Web information extraction, xScraper, Web-Harvest core techniques

朱炎朱凯. xScraper:基于Web-Harvest技术批量与深度获取无结构化Web信息[J]. 计算机科学, 2012, 39(12): 149-152. https://doi.org/

参考文献

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed