计算机科学 ›› 2011, Vol. 38 ›› Issue (4): 213-215.

• 数据库与数据挖掘 • 上一篇    下一篇

逆序解析DOM树及网页正文信息提取

张瑞雪,宋明秋,公衍磊   

  1. (大连理工大学系统工程研究所 大连116023)
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受国家自然科学基金项目(70671016)资助。

Parsing DOM Tree Reversely and Extracting Web Main Page Information

ZHANG Rui-xue,SONG Ming-qiu,GONG Yan-lei   

  • Online:2018-11-16 Published:2018-11-16

摘要: 一般地,从HTML网页中提取正文信息,应先将HTML、网页解析成DOM树,然后遍历DOM树,依据目标信息在DOM树中的分布规律,将信息从DOM树中提取。这种传统方法将解析DOM树和从DOM树中提取信息看成两个独立的过程,制约了提取信息的速度。事实上,在准确提取目标信息的过程中,独立解析整个DOM树是没有必要的。在此,提出了逆序解析DOM树算法,并结合DOM树相似理论和传统的顺序解析算法,从部分目标信息开始分别向后顺序和向前逆序解析DOM树,同时定位并获取其他目标信息。利用该方法提取网页正文信息,一方面只需解析部分DOM树,从而减少了解析树结构花费的时间,另一方面不需要遍历整个DOM树查找目标信息,从而节省了查找时间,大大提高了信息提取速度。最后,通过实验证实了该方法的优越性。

关键词: DOM树,网页正文提取,结构相似性,逆序解析

Abstract: To extract main content from HTML Web page, generally, we should parse HTML, visit the whole DOM tree, and extract the data from the tree by distribution. However, this method separates the two processes of parsing and extracting and therefore restricts the speed. Actually, parsing the whole DOM tree is unnecessary. Here we supposed the algorithm of parsing DOM tree by reverse order. Then combining with the theory of DOM similarity and the traditional method of parsing DOM we parsed IWM tree with both normal order and reverse order, and at the same time we fixed the positions of other targots and got them. On the one hand, this method only parses part of DOM tree, so it reduces the time cost by parsing. On the other hand, we do not have to visit the whole tree to search the target information, as a result, it saves the searching time. Overall, this method improves the speed much. At the end of this paper, we gave the proof on the superiority of this method.

Key words: DOM tree, Web content extracting, Structural similarity, Parsing reversely

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!