Computer Science ›› 2016, Vol. 43 ›› Issue (Z11): 31-34.doi: 10.11896/j.issn.1002-137X.2016.11A.007

Previous Articles     Next Articles

Approach of Extracting Web Page Informational Content Based on Node Type Annotation

XIE Fang-li, ZHOU Guo-min and WANG Jian   

  • Online:2018-12-01 Published:2018-12-01

Abstract: An approach based on DOM node type annotation was proposed to extract web page informational content.According to noise patterns in web page,we firstly classify DOM nodes into four types:text,image,anchor and ignorance ,and provide a method to calculate node degree of coherence(DoC).By adding two new attributes,type and DoC,to DOM node,we can select text nodes that have greater DoC than threshold during content extraction phase,and then integrate them as Web page informational content.In comparison to three other content extraction tools,the results show that in F1 index the proposed method is 95.1%,which is 0.3% higher than Evernote tool and 5.01% higher than YNote tool.

Key words: DOM,Node type annotation,Informational content extraction

[1] Gibson,David,Punera K,et al.The volume and evolution of Webpage templates[C]∥Special Interest Tracks and Posters of the 14th International Conference on World Wide Web.ACM,2005
[2] Wang Ji-ying,Lochovsky F H.Data-rich section extraction from html pages[C]∥Proceedings of the Third International Conference on Web Information Systems Engineering,2002(WISE 2002).IEEE,2002:313-322
[3] Yi L,Liu B,Li X.Eliminating noisy information in web pages for data mining[C]∥Proceedings of the 9th ACM SIGKDD Int Conference on Knowledge Discovery and Data Mining.New York:ACM,2003:296-305
[4] 欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2008(S1):1743-1747
[5] Bauer,Daniel,et al.FIASCO:Filtering the Internet by Automatic Subtree Classification,Osnabruck.Building and Exploring Web Corpora[C]∥Proceedings of the 3rd Web as Corpus Workshop,Incorporating Cleaneval.Vol.4.2007
[6] Lin S H,Ho J M.Discovering informative content blocks from Web documents[C]∥Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2002
[7] 时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法[J].计算机工程,2007,3(19):276-278
[8] Cai Deng,et al.VIPS:a vision based page segmentation algorithm.Microsoft technical report[R].MSR-TR-2003-79,2003
[9] 邹永强,钟志农.一种高效的新闻网页噪声过滤方法[J].微型机与应用,2011,0(16):64-67

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!