计算机科学 ›› 2019, Vol. 46 ›› Issue (10): 63-70.doi: 10.11896/jsjkx.190200346
王卫红, 梁朝凯, 闵勇
WANG Wei-hong, LIANG Chao-kai, MIN Yong
摘要: 网页具有丰富的内容和复杂多变的结构,现有的网页信息提取技术解决了单记录型简单页面的信息提取问题,但是对于多记录型复杂页面的信息提取效果往往不佳。文中提出了一种全新的基于可视块的复杂网页信息自动化提取算法(Visual Block Based Information Extraction,VBIE),通过启发式规则构建可视块与可视块树,然后通过区域聚焦、噪声过滤及可视块筛选,实现了对复杂网页中数据记录的提取。该方法摒弃了以往算法对网页结构的特定假设,无需对HTML文档进行任何人工标记,保留了网页的原始结构,且能够在单页面上实现无监督的信息提取。实验结果表明,VBIE的网页信息提取精确度最高可达100%,在主流搜索引擎的结果页面和社区论坛的帖子页面上的F1均值分别为98.5%和96.1%。相比目前方法中在复杂网页上提取效果较好的CMDR方法,VBIE的F1值提高了近16.3%,证明了该方法能够有效解决复杂网页的信息提取问题。
中图分类号:
[1]中国互联网络信息中心.CNNIC 发布第43次《中国互联网络发展状况统计报告》 [EB/OL].(2019-02-02).http://www.cac.gov.cn/2019-02/28/c_1124175677.html. [2]HAMMER J,MCHUGH J,GARCIA-MOLIN H.Semistruc-tured data:the TSIMMIS experience[C]//East-European Conference on Advances in Databases and Information Systems.British Computer Society,1997:1-8. [3]AROCENA G O,MENDELZON A O.WebOQL:restructuring documents,databases and Webs[C]//International Conference on Data Engineering.IEEE,1998:24-33. [4]NOVELLA T,HOLUBOVÁ I.User-Friendly and Extensible Web Data Extraction[M]//Advances in Information Systems Development.Cham:Springer,2018:225-241. [5]BU Z,ZHANG C,XIA Z,et al.An FAR-SW based approach for webpage information extraction[J].Information Systems Frontiers,2014,16(5):771-785. [6]OITA M,SENELLART P.FOREST:Focused object retrieval by exploiting significant tag paths[C]//Proceedings of the 18th International Workshop on Web and Databases.ACM,2015:55-61. [7]SAHUGUET A,AZAVANT F.Building intelligent web applications using lightweight wrappers[J].Data & Knowledge Engineering,2001,36(3):283-316. [8]LIU L,PU C,HAN W.XWRAP:An XML-Enabled Wrapper Construction System for Web Information Sources[C]//International Conference on Data Engineering.IEEE,2002. [9]BUTTLER D,LIU L,PU C.A fully automated object extraction system for the World Wide Web[C]//Proceedings 21st International Conference on Distributed Computing Systems.IEEE,2001:361-370. [10]CHANG C H,HSU C N,LUI S C.Automatic information extraction from semi-structured web pages by pattern discovery[J].Decision Support Systems,2003,35(1):129-147. [11]WEN Y,ZENG Q,DUAN H,et al.An Automatic Web Data Extraction Approach based on Path Index Trees[J].International Journal of Performability Engineering,2018,14(10):2449-2460. [12]LIU B,GROSSMAN R,ZHAI Y.Mining data records in web pages[C]//Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:601-606. [13]ZHAI Y,LIU B.Web data extraction based on partial tree alignment[C]//Proceedings of the 14th international conference on World Wide Web.ACM,2005:76-85. [14]HUANG X,GAO Y,HUANG L,et al.Web Content Extraction Using Clustering with Web Structure[C]// International Symposium on Neural Networks.Cham:Springer,2017:95-103. [15]CAI D,YU S,WEN J R,et al.Vips:a vision-based page segmentation algorithm:Technical Report MSR-TR-2003-79 [R].2003. [16]ZHAO H,MENG W,WU Z,et al.Fully automatic wrapper generation for search engines[C]//Proceedings of the 14th international conference on World Wide Web.ACM,2005:66-75. [17]SIMON K,LAUSEN G.ViPER:augmenting automatic information extraction with visual perceptions[C]//Proceedings of the 14th ACM international conference on Information and know-ledge management.ACM,2005:381-388. [18]LIU W,MENG X,MENG W.Vide:A vision-based approach for deep web data extraction[J].IEEE Transactions on Knowledge and Data Engineering,2009,22(3):447-460. [19]WAI F K,YONG L W,THING V L L,et al.CMDR:Classifying nodes for mining data records with different HTML structures[C]//TENCON 2017-2017 IEEE Region 10 Conference.IEEE,2017:1862-1862. [20]LIU J,LIN L,CAI Z,et al.Deep web data extraction based on visual information processing[J].Journal of Ambient Intelligence and Humanized Computing,2017,10(1):1-11. [21]GOGAR T,HUBACEK O,SEDIVY J.Deep neural networks for web page information extraction[C]//IFIP International Conference on Artificial Intelligence Applications and Innovations.Cham:Springer,2016:154-163. |
[1] | 栗辉,唐萌,陈豪. 基于用户行为分析的网站结构优化研究综述 Summary of Research on Website Structure Optimization Based on User Behaviour Analysis 计算机科学, 2016, 43(Z6): 384-386. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.091 |
[2] | 耿增民,商书元,邵新艳,周毅灵,马玲. 基于层次语义的Web服装图像智能采集方法 Hierarchical Semantic-based Web Intelligent Fashion Image Retrieval Method 计算机科学, 2016, 43(Z11): 252-255. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.058 |
[3] | 向东,赵勇,陈阳. 不确定性研讨信息的结构化与融合方法 Method of Structure and Fusion for Uncertainty Seminar Information 计算机科学, 2012, 39(3): 192-195. |
[4] | 朱沿旭,王怀民,史殿习,尹刚,袁霖,李翔. 基于缩进轮廓的HTML文档重复模式挖掘方法 Indent Shape Based Approach for Mining Repeated Patterns of HTML Documents 计算机科学, 2011, 38(8): 165-168. |
[5] | . Web新闻流的增量演进分析 计算机科学, 2009, 36(3): 193-195. |
[6] | 郭文宏 范学峰. 面向Web结构化信息处理的汉语知识库构建研究 计算机科学, 2009, 36(1): 201-204. |
[7] | 李红宇 刘庆江. 中文自动分类在搜索引擎中的应用研究 计算机科学, 2008, 35(5): 292-293. |
[8] | . 极大频繁子树挖掘及其应用 计算机科学, 2008, 35(2): 150-153. |
[9] | 田昌鹏. 基于Web日志分析的Web QoS研究 计算机科学, 2007, 34(6): 78-80. |
[10] | 任仲晟 薛永生. 基于页面标签的Web结构化数据抽取 计算机科学, 2007, 34(10): 133-136. |
[11] | 王彤彤 强龙江 王航. Web挖掘技术研究 计算机科学, 2006, 33(B12): 130-132. |
[12] | 王欣如. Web挖掘技术综述 计算机科学, 2006, 33(B12): 127-129. |
[13] | 戴东波 印鉴. 基于Web挖掘的自适应站点优化设计 计算机科学, 2006, 33(4): 126-129. |
[14] | . 个性化Web推荐服务研究 计算机科学, 2006, 33(2): 135-138. |
[15] | 耿桦 李媛 朱炜 潘金贵. Web搜索中的数据挖掘技术研究 计算机科学, 2005, 32(4): 37-41. |
|