Computer Science ›› 2015, Vol. 42 ›› Issue (5): 109-113.doi: 10.11896/j.issn.1002-137X.2015.05.022

Previous Articles     Next Articles

Research of Improved Tree Path Model in Web Page Clustering

WANG Ya-pu, WANG Zhi-jian and YE Feng   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Computing the similarity is the basis of text mining,and also the crucial step of information extraction.When tackling the Web pages with complex structure,the accuracy of computing the similarity based on traditional tree path model is not perfect.Traditional tree path model will not take the sequence of the paths in consideration and compare the similarity of paths by using perfect matching.It cannot describe the similarity between paths accurately when it is not a perfect matching.Therefore,the paper introduced the structural similarity Web at first,and then proposed a tree path model.This model takes fully account of the relationship between the siblings,the path location and the path weights,and makes up for the defect of the traditional tree path model which cannot express both document structure and hierarchical information.The experiment result shows that the model improves the recognition ability of Web pages structural similarity.It not only can better distinguish the Web pages which have large structure difference,but also effectively reflects the difference between the Web pages with the same template,also has a better effect in the Web page clustering.

Key words: Information extraction,Web page structure,Similarity,Tree path model,Clustering

[1] Li Yan-heng.The XML-based Information Extraction on Data-intensive Page[C]∥IFIP International Conference onNetwork and Parallel Computing Workshops,2007.NPC Workshops,IEEE,2007:1027-1030
[2] Li R,Pei C,Zheng J.Web Information Extraction Based on Hybrid Conditional Model[C]∥2010 Second International Workshop on Education Technology and Computer Science (ETCS).IEEE,2010,1:137-140
[3] 何昕,谢志鹏.基于简单树匹配算法的 Web 页面结构相似性度量[J].计算机研究与发展,2007(z3):1-6
[4] Tai K C.The tree-to-tree correction problem[J].Journal of the ACM (JACM),1979,26(3):422-433
[5] Cruz I F,Borisov S,Marks M A,et al.Measuring structural simi-larity among Web documents:preliminary results[M]∥Electronic Publishing,Artistic Imaging,and Digital Typography.Springer Berlin Heidelberg,1998:513-524
[6] Joshi S,Agrawal N,Krishnapuram R,et al.A bag of paths modelfor measuring structural similarity in Web documents[C]∥Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2003:577-582
[7] 王志琪,王永成.HTML 文件的文本信息预处理技术[J].计算机工程,2006,32(5):46-48
[8] Gupta S,Kaiser G,Neistadt D,et al.DOM-based content extraction of HTML documents[C]∥Proceedings of the 12th International Conference on World Wide Web.ACM,2003:207-214
[9] Bajcsy P,Ahuja N.Location-and density-based hierarchical clustering using similarity analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(9):1011-1015
[10] Han J,Kamber M,Pei J.Data Mining:Concepts and Techniques (Third Edition)[M].Thailand:Elsevier Pte Ltd,2012:297-302
[11] McCarthy J F,Lehnert W G.Using decision trees for conference resolution[C]∥The Fourteenth International Joint Conference on Artificial Intelligence.1995:109-114

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!