逆序解析DOM树及网页正文信息提取

Computer Science ›› 2011, Vol. 38 ›› Issue (4): 213-215.

Parsing DOM Tree Reversely and Extracting Web Main Page Information

ZHANG Rui-xue,SONG Ming-qiu,GONG Yan-lei

Online:2018-11-16 Published:2018-11-16

Abstract

Abstract: To extract main content from HTML Web page, generally, we should parse HTML, visit the whole DOM tree, and extract the data from the tree by distribution. However, this method separates the two processes of parsing and extracting and therefore restricts the speed. Actually, parsing the whole DOM tree is unnecessary. Here we supposed the algorithm of parsing DOM tree by reverse order. Then combining with the theory of DOM similarity and the traditional method of parsing DOM we parsed IWM tree with both normal order and reverse order, and at the same time we fixed the positions of other targots and got them. On the one hand, this method only parses part of DOM tree, so it reduces the time cost by parsing. On the other hand, we do not have to visit the whole tree to search the target information, as a result, it saves the searching time. Overall, this method improves the speed much. At the end of this paper, we gave the proof on the superiority of this method.

Key words: DOM tree, Web content extracting, Structural similarity, Parsing reversely

ZHANG Rui-xue,SONG Ming-qiu,GONG Yan-lei. Parsing DOM Tree Reversely and Extracting Web Main Page Information[J].Computer Science, 2011, 38(4): 213-215.