Computer Science ›› 2018, Vol. 45 ›› Issue (6A): 583-587.

• Interdiscipline & Application • Previous Articles     Next Articles

Novel Method of Web Page Segmentation Based on Title Machine Learning

LI Jin-sheng1,LE Hui-xiao2,TONG Ming-wen2   

  1. Modern Education Technical Center,The Open University of Wuhan,Wuhan 430033,China1
    School of Education Information Technology,Central China Normal University,Wuhan 430079,China2
  • Online:2018-06-20 Published:2018-08-03

Abstract: To solve the problem that it is difficult to implement the web page segmentation method based on document object model (DOM),a novel method was proposed through employing string model.The feature of the title of a web page is dug out by machine learning.Based on the found title,the web page is segmented.Firstly,the titles in web pages are picked up by the information of liner block function and title tag.Secondly,web pages are partitioned into content blocks by using the titles.Finally,the content blocks are merged by block depth information.It is proved that the complexity of algorithms in the method are O(n),and the method is suitable for web pages in the university portal,blog and resource web sites.The method is useful for many applications in web page information management,and it has a good prospect.

Key words: Block depth, Liner block function, Machine learning, Title, Webpage segmentation

CLC Number: 

  • TP37
[1]CAI D,YU S,WEN J R,et al.VIPS:a vision-based page segmentation algorithm:MSR-TR-2003-79[R].Microsoft Technical Report,2003.
[2]CHEN Y,XIE X,MA W Y,et al.Adapting web pages for small-screen devices[J].IEEE Internet Computing,2005,9(1):50-56.
[3]李文昊,彭红超.基于视觉特征的网页最优分割算法[J].计算机科学,2015,42(11):284-287.
[4]王琦,唐世渭,杨冬青,等.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1791.
[5]HATTOR G,HOASH I K,MATSUMOTO I K,et al.Robust- Web Page Segmentation for Mobile Terminal Using Content-Distances and Page LayoutInformation[C]∥Proceedings of the Sixteenth International World Wide Web Conference(WWW 2007).2007.
[6]ZELENY J,BURGET R,ZENDULKA J.Box clustering seg- mentation:A new method for vision-based web page preproces-sing[J].Information Processing and Management,2017,53(3):735-750.
[7]孙晓辉,刘建,王劲林,等.基于CSS的网页分割算法[J].网络新媒体技术,2008,29(9):46-51.
[8]陈鑫.基于行块分布函数的通用网页正文抽取[EB/OL].http://www.cnblogs.com/loveyakamoz/archive/2011/08/17/2143446.html.
[9]CAI D.ViPS:a Vision based Page Segmentation Algorithm [EB/ OL].http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.html.
[10]ARULJOTHI S,SIVARANJANI S,SIVAKUMARI S.Web Page Segmentation for Small Screen Devices Using Tag Path Clustering Approach[J].International Journal on Computer Scien-ce and Engineering,2013,5(7):617-624.
[1] LENG Dian-dian, DU Peng, CHEN Jian-ting, XIANG Yang. Automated Container Terminal Oriented Travel Time Estimation of AGV [J]. Computer Science, 2022, 49(9): 208-214.
[2] NING Han-yang, MA Miao, YANG Bo, LIU Shi-chang. Research Progress and Analysis on Intelligent Cryptology [J]. Computer Science, 2022, 49(9): 288-296.
[3] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[4] LI Yao, LI Tao, LI Qi-fan, LIANG Jia-rui, Ibegbu Nnamdi JULIAN, CHEN Jun-jie, GUO Hao. Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network [J]. Computer Science, 2022, 49(8): 257-266.
[5] ZHANG Guang-hua, GAO Tian-jiao, CHEN Zhen-guo, YU Nai-wen. Study on Malware Classification Based on N-Gram Static Analysis Technology [J]. Computer Science, 2022, 49(8): 336-343.
[6] CHEN Ming-xin, ZHANG Jun-bo, LI Tian-rui. Survey on Attacks and Defenses in Federated Learning [J]. Computer Science, 2022, 49(7): 310-323.
[7] XIAO Zhi-hong, HAN Ye-tong, ZOU Yong-pan. Study on Activity Recognition Based on Multi-source Data and Logical Reasoning [J]. Computer Science, 2022, 49(6A): 397-406.
[8] YAO Ye, ZHU Yi-an, QIAN Liang, JIA Yao, ZHANG Li-xiang, LIU Rui-liang. Android Malware Detection Method Based on Heterogeneous Model Fusion [J]. Computer Science, 2022, 49(6A): 508-515.
[9] WANG Fei, HUANG Tao, YANG Ye. Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion [J]. Computer Science, 2022, 49(6A): 784-789.
[10] LI Ya-ru, ZHANG Yu-lai, WANG Jia-chen. Survey on Bayesian Optimization Methods for Hyper-parameter Tuning [J]. Computer Science, 2022, 49(6A): 86-92.
[11] ZHAO Lu, YUAN Li-ming, HAO Kun. Review of Multi-instance Learning Algorithms [J]. Computer Science, 2022, 49(6A): 93-99.
[12] XU Jie, ZHU Yu-kun, XING Chun-xiao. Application of Machine Learning in Financial Asset Pricing:A Review [J]. Computer Science, 2022, 49(6): 276-286.
[13] LI Ye, CHEN Song-can. Physics-informed Neural Networks:Recent Advances and Prospects [J]. Computer Science, 2022, 49(4): 254-262.
[14] YAO Xiao-ming, DING Shi-chang, ZHAO Tao, HUANG Hong, LUO Jar-der, FU Xiao-ming. Big Data-driven Based Socioeconomic Status Analysis:A Survey [J]. Computer Science, 2022, 49(4): 80-87.
[15] ZHANG Ying-li, MA Jia-li, LIU Zi-ang, LIU Xin, ZHOU Rui. Overview of Vulnerability Detection Methods for Ethereum Solidity Smart Contracts [J]. Computer Science, 2022, 49(3): 52-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!