计算机科学 ›› 2016, Vol. 43 ›› Issue (Z11): 31-34.doi: 10.11896/j.issn.1002-137X.2016.11A.007

• 智能计算 • 上一篇    下一篇

基于节点类型标注的网页主题信息抽取方法

谢方立,周国民,王健   

  1. 中国农业科学院农业信息研究所 北京100081,中国农业科学院农业信息研究所 北京100081,中国农业科学院农业信息研究所 北京100081
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家高技术研究发展计划(2013AA102405)资助

Approach of Extracting Web Page Informational Content Based on Node Type Annotation

XIE Fang-li, ZHOU Guo-min and WANG Jian   

  • Online:2018-12-01 Published:2018-12-01

摘要: 提出一种基于DOM节点类型标注的网页主题信息抽取的方法。首先依据网页中噪声存在的形式,将DOM节点划分为4种类型:文本型、图片型、链接型和可忽略型,并给出节点内聚度的计算方法。通过给DOM节点添加类型和内聚度两个属性,在正文提取阶段选取内聚度大于阈值的文本型节点,最后整合成网页主题信息。将该方法与另外3款网页正文提取工具做对比实验,结果显示 该方法 在F1指标上为95.1%,比Evernote工具高出0.3%,比YNote工具高出5.01%。

关键词: DOM,节点类型标注,主题信息抽取

Abstract: An approach based on DOM node type annotation was proposed to extract web page informational content.According to noise patterns in web page,we firstly classify DOM nodes into four types:text,image,anchor and ignorance ,and provide a method to calculate node degree of coherence(DoC).By adding two new attributes,type and DoC,to DOM node,we can select text nodes that have greater DoC than threshold during content extraction phase,and then integrate them as Web page informational content.In comparison to three other content extraction tools,the results show that in F1 index the proposed method is 95.1%,which is 0.3% higher than Evernote tool and 5.01% higher than YNote tool.

Key words: DOM,Node type annotation,Informational content extraction

[1] Gibson,David,Punera K,et al.The volume and evolution of Webpage templates[C]∥Special Interest Tracks and Posters of the 14th International Conference on World Wide Web.ACM,2005
[2] Wang Ji-ying,Lochovsky F H.Data-rich section extraction from html pages[C]∥Proceedings of the Third International Conference on Web Information Systems Engineering,2002(WISE 2002).IEEE,2002:313-322
[3] Yi L,Liu B,Li X.Eliminating noisy information in web pages for data mining[C]∥Proceedings of the 9th ACM SIGKDD Int Conference on Knowledge Discovery and Data Mining.New York:ACM,2003:296-305
[4] 欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2008(S1):1743-1747
[5] Bauer,Daniel,et al.FIASCO:Filtering the Internet by Automatic Subtree Classification,Osnabruck.Building and Exploring Web Corpora[C]∥Proceedings of the 3rd Web as Corpus Workshop,Incorporating Cleaneval.Vol.4.2007
[6] Lin S H,Ho J M.Discovering informative content blocks from Web documents[C]∥Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2002
[7] 时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法[J].计算机工程,2007,3(19):276-278
[8] Cai Deng,et al.VIPS:a vision based page segmentation algorithm.Microsoft technical report[R].MSR-TR-2003-79,2003
[9] 邹永强,钟志农.一种高效的新闻网页噪声过滤方法[J].微型机与应用,2011,0(16):64-67

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!