计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 583-587.

• 综合、交叉与应用 • 上一篇    下一篇

基于标题机器学习的网页分割方法

李进生1,乐惠骁2,童名文2   

  1. 武汉市广播电视大学现代教育技术中心 武汉4300331
    华中师范大学教育信息技术学院 武汉4300792
  • 出版日期:2018-06-20 发布日期:2018-08-03
  • 作者简介:李进生(1965-),男,教授,主要研究方向为终生学习理论、网络信息管理;乐惠骁(1995-),男,硕士,主要研究方向为网页信息处理;童名文(1975-),男,博士,教授,主要研究方向为数字化学习资源适配、教育资源管理,E-mail:tmw@mail.ccnu.edu.cn(通信作者)。
  • 基金资助:
    教育部人文社科基金资助项目:数字化学习资源无障碍适配决策模型研究(15YJA880062)资助

Novel Method of Web Page Segmentation Based on Title Machine Learning

LI Jin-sheng1,LE Hui-xiao2,TONG Ming-wen2   

  1. Modern Education Technical Center,The Open University of Wuhan,Wuhan 430033,China1
    School of Education Information Technology,Central China Normal University,Wuhan 430079,China2
  • Online:2018-06-20 Published:2018-08-03

摘要: 针对已有网页分割方法都基于文档对象模型实现且实现难度较高的问题,提出了一种采用字符串数据模型实现网页分割的新方法。该方法通过机器学习获取网页标题的特征,利用标题实现网页分割。首先,利用网页行块分布函数和网页标题标签学习得到网页标题特征;然后,基于标题将网页分割成内容块;最后,利用块深度对内容块进行合并,完成网页分割。理论分析与实验结果表明,该方法中的算法具有O(n)的时间复杂度和空间复杂度,该方法对于高校门户、博客日志和资源网站等类型的网页具有较好的分割效果,并且可以用于网页信息管理的多种应用中,具有良好的应用前景。

关键词: 标题, 机器学习, 块深度, 网页分割, 行块分布函数

Abstract: To solve the problem that it is difficult to implement the web page segmentation method based on document object model (DOM),a novel method was proposed through employing string model.The feature of the title of a web page is dug out by machine learning.Based on the found title,the web page is segmented.Firstly,the titles in web pages are picked up by the information of liner block function and title tag.Secondly,web pages are partitioned into content blocks by using the titles.Finally,the content blocks are merged by block depth information.It is proved that the complexity of algorithms in the method are O(n),and the method is suitable for web pages in the university portal,blog and resource web sites.The method is useful for many applications in web page information management,and it has a good prospect.

Key words: Block depth, Liner block function, Machine learning, Title, Webpage segmentation

中图分类号: 

  • TP37
[1]CAI D,YU S,WEN J R,et al.VIPS:a vision-based page segmentation algorithm:MSR-TR-2003-79[R].Microsoft Technical Report,2003.
[2]CHEN Y,XIE X,MA W Y,et al.Adapting web pages for small-screen devices[J].IEEE Internet Computing,2005,9(1):50-56.
[3]李文昊,彭红超.基于视觉特征的网页最优分割算法[J].计算机科学,2015,42(11):284-287.
[4]王琦,唐世渭,杨冬青,等.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1791.
[5]HATTOR G,HOASH I K,MATSUMOTO I K,et al.Robust- Web Page Segmentation for Mobile Terminal Using Content-Distances and Page LayoutInformation[C]∥Proceedings of the Sixteenth International World Wide Web Conference(WWW 2007).2007.
[6]ZELENY J,BURGET R,ZENDULKA J.Box clustering seg- mentation:A new method for vision-based web page preproces-sing[J].Information Processing and Management,2017,53(3):735-750.
[7]孙晓辉,刘建,王劲林,等.基于CSS的网页分割算法[J].网络新媒体技术,2008,29(9):46-51.
[8]陈鑫.基于行块分布函数的通用网页正文抽取[EB/OL].http://www.cnblogs.com/loveyakamoz/archive/2011/08/17/2143446.html.
[9]CAI D.ViPS:a Vision based Page Segmentation Algorithm [EB/ OL].http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.html.
[10]ARULJOTHI S,SIVARANJANI S,SIVAKUMARI S.Web Page Segmentation for Small Screen Devices Using Tag Path Clustering Approach[J].International Journal on Computer Scien-ce and Engineering,2013,5(7):617-624.
[1] 冷典典, 杜鹏, 陈建廷, 向阳.
面向自动化集装箱码头的AGV行驶时间估计
Automated Container Terminal Oriented Travel Time Estimation of AGV
计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[2] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[3] 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩.
基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究
Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network
计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[4] 张光华, 高天娇, 陈振国, 于乃文.
基于N-Gram静态分析技术的恶意软件分类研究
Study on Malware Classification Based on N-Gram Static Analysis Technology
计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[5] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[6] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[7] 肖治鸿, 韩晔彤, 邹永攀.
基于多源数据和逻辑推理的行为识别技术研究
Study on Activity Recognition Based on Multi-source Data and Logical Reasoning
计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[8] 姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮.
一种基于异质模型融合的 Android 终端恶意软件检测方法
Android Malware Detection Method Based on Heterogeneous Model Fusion
计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[9] 李亚茹, 张宇来, 王佳晨.
面向超参数估计的贝叶斯优化方法综述
Survey on Bayesian Optimization Methods for Hyper-parameter Tuning
计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[10] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[11] 王飞, 黄涛, 杨晔.
基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究
Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion
计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[12] 许杰, 祝玉坤, 邢春晓.
机器学习在金融资产定价中的应用研究综述
Application of Machine Learning in Financial Asset Pricing:A Review
计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127
[13] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[14] 李野, 陈松灿.
基于物理信息的神经网络:最新进展与展望
Physics-informed Neural Networks:Recent Advances and Prospects
计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158
[15] 张潆藜, 马佳利, 刘子昂, 刘新, 周睿.
以太坊Solidity智能合约漏洞检测方法综述
Overview of Vulnerability Detection Methods for Ethereum Solidity Smart Contracts
计算机科学, 2022, 49(3): 52-61. https://doi.org/10.11896/jsjkx.210700004
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!