一种基于分类算法的网页信息提取方法

计算机科学 ›› 2008, Vol. 35 ›› Issue (3): 91-93.

一种基于分类算法的网页信息提取方法

出版日期:2018-11-16 发布日期:2018-11-16
基金资助:
基金资助：国家242基金（课题编号：20051322,2006820）.

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 在目前的web信息提取技术中，很多都是基于HTML结构的，由于HTML结构的经常变化，使提取模板需要经常更新，而提取模板的更新需要很多领域知识。本文提出一种基于分类算法的web信息提取方法，通过将网页文本按照其显示属性的不同进行分组，以显示属性值为基础对Web页面文本进行分类，获取所关注文本，从而完成对Web页面的信息提取。这种提取方法操作简单，易于实现，对网页结构的依赖性小。

关键词: 信息提取属性向量 Wrapper 显示属性

Abstract: In the research of Web information extraction, most of the existing algorithms are based on HTML struc-ture. As the structure of HTML files changes frequently, wrapper must be updated accordingly. But the update of wrapper needs a lot of domain knowledge.

Key words: Web information extraction,Attribute vector,Wrapper,Display attributes

. 一种基于分类算法的网页信息提取方法[J]. 计算机科学, 2008, 35(3): 91-93. https://doi.org/

参考文献

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed