计算机科学 ›› 2019, Vol. 46 ›› Issue (10): 63-70.doi: 10.11896/jsjkx.190200346

• 大数据与数据科学* • 上一篇    下一篇

基于可视块的多记录型复杂网页信息提取算法

王卫红, 梁朝凯, 闵勇   

  1. (浙江工业大学计算机科学与技术学院 杭州310023)
  • 收稿日期:2019-02-23 修回日期:2019-05-19 出版日期:2019-10-15 发布日期:2019-10-21
  • 通讯作者: 梁朝凯(1994-),男,硕士,主要研究方向为网页信息提取;闵 勇(1981-),男,博士,副教授,主要研究方向为社交网络分析、网络数据挖掘,E-mail:myong@zjut.edu.cn。
  • 作者简介:王卫红(1969-),男,硕士,教授,主要研究方向为空间信息服务、网络技术与安全。
  • 基金资助:
    本文受浙江省自然科学基金(LY17G030030,LGF18D010001,LGF18D010002)资助。

Multi-recording Complex Webpage Information Extraction Algorithm Based on Visual Block

WANG Wei-hong, LIANG Chao-kai, MIN Yong   

  1. (College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China)
  • Received:2019-02-23 Revised:2019-05-19 Online:2019-10-15 Published:2019-10-21

摘要: 网页具有丰富的内容和复杂多变的结构,现有的网页信息提取技术解决了单记录型简单页面的信息提取问题,但是对于多记录型复杂页面的信息提取效果往往不佳。文中提出了一种全新的基于可视块的复杂网页信息自动化提取算法(Visual Block Based Information Extraction,VBIE),通过启发式规则构建可视块与可视块树,然后通过区域聚焦、噪声过滤及可视块筛选,实现了对复杂网页中数据记录的提取。该方法摒弃了以往算法对网页结构的特定假设,无需对HTML文档进行任何人工标记,保留了网页的原始结构,且能够在单页面上实现无监督的信息提取。实验结果表明,VBIE的网页信息提取精确度最高可达100%,在主流搜索引擎的结果页面和社区论坛的帖子页面上的F1均值分别为98.5%和96.1%。相比目前方法中在复杂网页上提取效果较好的CMDR方法,VBIE的F1值提高了近16.3%,证明了该方法能够有效解决复杂网页的信息提取问题。

关键词: Web数据抽取, Web挖掘, 结构化信息, 数据记录提取, 网页数据提取

Abstract: The webpage has rich content and complicated and varied structure.The existing webpage information extraction technology solves the information extraction of the single-recording simple page,but the information extraction effect of the multi-recording complex page is often poor.This paper proposed a new visual block based information extraction algorithm,named visual block based information extraction (VBIE).By constructing visual blocks and visual block trees,and through heuristic rules,regional focus,noise filtering and visual block filtering,data record extraction is realized for complex webpages.Compared with other existing methods,this method abandons the specific assumptions of the previous algorithm on the structure of the webpage,does not need to manually mark the HTML document,preserves the original structure of the webpage,and can realize unsupervised information extraction on a single page.The experimental results show that VBIE’s webpage information extraction accuracy is up to 100%,and the average value of F1 on the results page of the mainstream search engine and the post page of the community forum are 98.5% and 96.1%.Compared with the current method CMDR,the F1 value of VBIE is improved by nearly 16.3%,which proves that the method can effectively solve the information extraction task of complex webpages.

Key words: Data record extraction, Structured information, Web data extraction, Web mining, Webpage data extraction

中图分类号: 

  • TP391
[1]中国互联网络信息中心.CNNIC 发布第43次《中国互联网络发展状况统计报告》 [EB/OL].(2019-02-02).http://www.cac.gov.cn/2019-02/28/c_1124175677.html.
[2]HAMMER J,MCHUGH J,GARCIA-MOLIN H.Semistruc-tured data:the TSIMMIS experience[C]//East-European Conference on Advances in Databases and Information Systems.British Computer Society,1997:1-8.
[3]AROCENA G O,MENDELZON A O.WebOQL:restructuring documents,databases and Webs[C]//International Conference on Data Engineering.IEEE,1998:24-33.
[4]NOVELLA T,HOLUBOVÁ I.User-Friendly and Extensible Web Data Extraction[M]//Advances in Information Systems Development.Cham:Springer,2018:225-241.
[5]BU Z,ZHANG C,XIA Z,et al.An FAR-SW based approach for webpage information extraction[J].Information Systems Frontiers,2014,16(5):771-785.
[6]OITA M,SENELLART P.FOREST:Focused object retrieval by exploiting significant tag paths[C]//Proceedings of the 18th International Workshop on Web and Databases.ACM,2015:55-61.
[7]SAHUGUET A,AZAVANT F.Building intelligent web applications using lightweight wrappers[J].Data & Knowledge Engineering,2001,36(3):283-316.
[8]LIU L,PU C,HAN W.XWRAP:An XML-Enabled Wrapper Construction System for Web Information Sources[C]//International Conference on Data Engineering.IEEE,2002.
[9]BUTTLER D,LIU L,PU C.A fully automated object extraction system for the World Wide Web[C]//Proceedings 21st International Conference on Distributed Computing Systems.IEEE,2001:361-370.
[10]CHANG C H,HSU C N,LUI S C.Automatic information extraction from semi-structured web pages by pattern discovery[J].Decision Support Systems,2003,35(1):129-147.
[11]WEN Y,ZENG Q,DUAN H,et al.An Automatic Web Data Extraction Approach based on Path Index Trees[J].International Journal of Performability Engineering,2018,14(10):2449-2460.
[12]LIU B,GROSSMAN R,ZHAI Y.Mining data records in web pages[C]//Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:601-606.
[13]ZHAI Y,LIU B.Web data extraction based on partial tree alignment[C]//Proceedings of the 14th international conference on World Wide Web.ACM,2005:76-85.
[14]HUANG X,GAO Y,HUANG L,et al.Web Content Extraction Using Clustering with Web Structure[C]// International Symposium on Neural Networks.Cham:Springer,2017:95-103.
[15]CAI D,YU S,WEN J R,et al.Vips:a vision-based page segmentation algorithm:Technical Report MSR-TR-2003-79 [R].2003.
[16]ZHAO H,MENG W,WU Z,et al.Fully automatic wrapper generation for search engines[C]//Proceedings of the 14th international conference on World Wide Web.ACM,2005:66-75.
[17]SIMON K,LAUSEN G.ViPER:augmenting automatic information extraction with visual perceptions[C]//Proceedings of the 14th ACM international conference on Information and know-ledge management.ACM,2005:381-388.
[18]LIU W,MENG X,MENG W.Vide:A vision-based approach for deep web data extraction[J].IEEE Transactions on Knowledge and Data Engineering,2009,22(3):447-460.
[19]WAI F K,YONG L W,THING V L L,et al.CMDR:Classifying nodes for mining data records with different HTML structures[C]//TENCON 2017-2017 IEEE Region 10 Conference.IEEE,2017:1862-1862.
[20]LIU J,LIN L,CAI Z,et al.Deep web data extraction based on visual information processing[J].Journal of Ambient Intelligence and Humanized Computing,2017,10(1):1-11.
[21]GOGAR T,HUBACEK O,SEDIVY J.Deep neural networks for web page information extraction[C]//IFIP International Conference on Artificial Intelligence Applications and Innovations.Cham:Springer,2016:154-163.
[1] 栗辉,唐萌,陈豪.
基于用户行为分析的网站结构优化研究综述
Summary of Research on Website Structure Optimization Based on User Behaviour Analysis
计算机科学, 2016, 43(Z6): 384-386. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.091
[2] 耿增民,商书元,邵新艳,周毅灵,马玲.
基于层次语义的Web服装图像智能采集方法
Hierarchical Semantic-based Web Intelligent Fashion Image Retrieval Method
计算机科学, 2016, 43(Z11): 252-255. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.058
[3] 向东,赵勇,陈阳.
不确定性研讨信息的结构化与融合方法
Method of Structure and Fusion for Uncertainty Seminar Information
计算机科学, 2012, 39(3): 192-195.
[4] 朱沿旭,王怀民,史殿习,尹刚,袁霖,李翔.
基于缩进轮廓的HTML文档重复模式挖掘方法
Indent Shape Based Approach for Mining Repeated Patterns of HTML Documents
计算机科学, 2011, 38(8): 165-168.
[5] .
Web新闻流的增量演进分析

计算机科学, 2009, 36(3): 193-195.
[6] 郭文宏 范学峰.
面向Web结构化信息处理的汉语知识库构建研究

计算机科学, 2009, 36(1): 201-204.
[7] 李红宇 刘庆江.
中文自动分类在搜索引擎中的应用研究

计算机科学, 2008, 35(5): 292-293.
[8] .
极大频繁子树挖掘及其应用

计算机科学, 2008, 35(2): 150-153.
[9] 田昌鹏.
基于Web日志分析的Web QoS研究

计算机科学, 2007, 34(6): 78-80.
[10] 任仲晟 薛永生.
基于页面标签的Web结构化数据抽取

计算机科学, 2007, 34(10): 133-136.
[11] 王彤彤 强龙江 王航.
Web挖掘技术研究

计算机科学, 2006, 33(B12): 130-132.
[12] 王欣如.
Web挖掘技术综述

计算机科学, 2006, 33(B12): 127-129.
[13] 戴东波 印鉴.
基于Web挖掘的自适应站点优化设计

计算机科学, 2006, 33(4): 126-129.
[14] .
个性化Web推荐服务研究

计算机科学, 2006, 33(2): 135-138.
[15] 耿桦 李媛 朱炜 潘金贵.
Web搜索中的数据挖掘技术研究

计算机科学, 2005, 32(4): 37-41.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!