计算机科学 ›› 2016, Vol. 43 ›› Issue (Z11): 31-34.doi: 10.11896/j.issn.1002-137X.2016.11A.007
谢方立,周国民,王健
XIE Fang-li, ZHOU Guo-min and WANG Jian
摘要: 提出一种基于DOM节点类型标注的网页主题信息抽取的方法。首先依据网页中噪声存在的形式,将DOM节点划分为4种类型:文本型、图片型、链接型和可忽略型,并给出节点内聚度的计算方法。通过给DOM节点添加类型和内聚度两个属性,在正文提取阶段选取内聚度大于阈值的文本型节点,最后整合成网页主题信息。将该方法与另外3款网页正文提取工具做对比实验,结果显示 该方法 在F1指标上为95.1%,比Evernote工具高出0.3%,比YNote工具高出5.01%。
[1] Gibson,David,Punera K,et al.The volume and evolution of Webpage templates[C]∥Special Interest Tracks and Posters of the 14th International Conference on World Wide Web.ACM,2005 [2] Wang Ji-ying,Lochovsky F H.Data-rich section extraction from html pages[C]∥Proceedings of the Third International Conference on Web Information Systems Engineering,2002(WISE 2002).IEEE,2002:313-322 [3] Yi L,Liu B,Li X.Eliminating noisy information in web pages for data mining[C]∥Proceedings of the 9th ACM SIGKDD Int Conference on Knowledge Discovery and Data Mining.New York:ACM,2003:296-305 [4] 欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2008(S1):1743-1747 [5] Bauer,Daniel,et al.FIASCO:Filtering the Internet by Automatic Subtree Classification,Osnabruck.Building and Exploring Web Corpora[C]∥Proceedings of the 3rd Web as Corpus Workshop,Incorporating Cleaneval.Vol.4.2007 [6] Lin S H,Ho J M.Discovering informative content blocks from Web documents[C]∥Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2002 [7] 时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法[J].计算机工程,2007,3(19):276-278 [8] Cai Deng,et al.VIPS:a vision based page segmentation algorithm.Microsoft technical report[R].MSR-TR-2003-79,2003 [9] 邹永强,钟志农.一种高效的新闻网页噪声过滤方法[J].微型机与应用,2011,0(16):64-67 |
No related articles found! |
|