计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 218-224.doi: 10.11896/jsjkx.210100230

• 大数据&数据科学 • 上一篇    下一篇

基于特征相似度计算的网页包装器自适应

陈迎仁, 郭莹楠, 郭享, 倪一涛, 陈星   

  1. 福州大学数学与计算机科学学院 福州350108
    福建省网络计算与智能信息处理重点实验室(福州大学) 福州350108
  • 出版日期:2021-11-10 发布日期:2021-11-12
  • 通讯作者: 倪一涛(yitao_ni@fzu.edu.cn)
  • 作者简介:2318191704@qq.com
  • 基金资助:
    国家重点研发计划(2017YFB1002000);福建省自然科学基金杰青项目(2020J06014);福建省自然科学基金项目(2018J07005)

Web Page Wrapper Adaptation Based on Feature Similarity Calculation

CHEN Ying-ren, GUO Ying-nan, GUO Xiang, NI Yi-tao, CHEN Xing   

  1. College of Mathematics and Computer Science,Fuzhou University,Fuzhou 350108,China
    Fujian Key Laboratory of Network Computing and Intelligent Information Processing (Fuzhou University),Fuzhou 350108,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:CHEN Ying-ren,born in 1997,postgra-duate.His main research interests include software adaptation and know-ledge mapping.
    NI Yi-tao,born in 1969,Ph.D,is a member of China Computer Federation.His main research interests include software engineering,system security and so on.
  • Supported by:
    National Key R&D Program of China(2017YFB1002000),Natural Science Foundation of Fujian Province for Distinguished Young Scholars(2020J06014) and Natural Science Foundation of Fujian Province(2018J07005).

摘要: 随着大数据的发展,互联网数据呈现爆炸式的增长。Web作为一种重要的信息载体,包含了各种类型的信息,而包装器的提出就是为了从杂乱的Web信息中提取出目标数据。但是,随着网页更新的频繁,轻微的结构变化都可能导致原有的包装器失效,增加包装器的维护成本。针对包装器的健壮性以及维护成本问题,提出了一种基于特征相似度计算的网页包装器自适应技术。该技术主要通过解析新网页的特征集合和旧包装器所蕴含的特征信息,通过网页相似度计算,重定位旧包装器在新网页中的映射区域和映射数据项,并根据映射关系使旧包装器能够自适应新网页的数据提取。该技术主要针对各类型网站进行实验,其中包括了购物类、新闻类、资讯类、论坛类和服务类,从中选取了250对新旧版本网页,共500个网页,进行包装器自适应实验。实验结果表明,当网页结构改变时,该方法能够有效地自适应新网页的数据提取,且数据提取的平均精确度和平均召回值分别达到 82.2%和 84.36%。

关键词: 包装器, 网页数据抽取, 网页特征, 相似度计算, 自适应

Abstract: With the development of big data,Internet data has exploded.As an important information carrier,the Web contains various types of information.The wrapper is proposed to extract target data from messy Web information.However,with frequent Web page updates,minor structural changes may cause the original wrapper to fail,leading to increased maintenance costs for the wrapper.Aiming at the robustness and maintenance cost of the wrapper,a Web page wrapper adaptive technology based on feature similarity calculation is proposed.This technology mainly analyzes the feature set of the new Web page and the feature information contained in the old wrapper,and calculates the similarity of the Web page to relocate the mapping area and mapping data items of the old wrapper in the new Web page,and make the old wrapper based on the mapping relationship able to adapt the data extraction of new Web pages.The technology is mainly used for experiments on various types of Websites,including shopping,news,information,forums and services.250 pairs of old and new versions of Web pages,totaling 500 Web pages,are selected for wrapper adaptation experiments.The experimental results show that when the Web page structure changes,the method can effectively adapt to the data extraction of the new Web page,and the average precision and average recall of data extraction reach 82.2% and 84.36%,respectively.

Key words: Adaptation, Page features, Similarity calculation, Web page data extraction, Wrapper

中图分类号: 

  • TP311
[1]CNNIĆs 45th Statistical Report on the Development of China'sInternet[EB/OL].http://www.cac.gov.cn/2020-04/27/c_1589535470378587.html.
[2]CUI C,GONG J.Overview of Web Information Extraction Research[J].Computer Knowledge and Technology:Academic Exchange,2011,7(4):2279-2280.
[3]CAFARELLA M J,HALEVY A Y,WANG D Z,et al.Web-Tables:Exploring the power of tables on the web[J].Procee-dings of the VLDB Endowment,2008,1(1):538-549.
[4]ZHANG J.Research and Implementation of Web InformationAutomatic Extraction Technology[D].Wuhan:Wuhan University of Technology,2009.
[5]EMILIO F,ROBERT B.Automatic Wrapper Adaptation byTree Edit Distance Matching[C]//Proceedings of the 2nd International CIMA Workshop.Springer,2011:41-54.
[6]CHIDLOVSKII B.Automatic Repairing of Web Wrappers[C]//Proceeding of the Third International Workshop.ACM,2001:24-30.
[7]KNOBLOCK C A,LERMAN K,MINTON S N.Wrapper Maintenance:A Machine Learning Approach[J].Computer Science,2011,18(1):2003.
[8]MENG X,HU D,LI C.Schema-guided wrapper maintenance for web-data extraction[C]//Fifth ACM CIKM International Workshop on Web Information and Data Management.ACM,2003:1-8.
[9]KOWALKIEWICZ M,KACZMAREK T,ABRAMOWICZ W.myPortal:Robust Extraction and Aggregation of Web Content[C]//Proceedings of the 32nd International Conference on Very Large Data Bases.DBLP,2006:1219-1222.
[10]DALVI N N,BOHANNON P,SHA F.Robust web extraction:an approach based on a probabilistic tree-edit model[C]//ACM Sigmod International Conference on Management of Data.ACM,2009:335-348.
[11]LEOTTA M,STOCCO A,RICCA F,et al.Reducing Web Test Cases Aging by Means of Robust XPath Locators[C]//IEEE International Symposium on Software Reliability Engineering Workshops.IEEE,2014:449-454.
[12]LIU D,WANG X,LI H,et al.Robust Web Extraction Based on Minimum Cost Script Edit Model[J].Procedia Engineering,2012,29(1):1119-1125.
[13]CHU Y C,HSU C C,LEE C J,et al.Automatic data extraction of websites using data path matching and alignment[C]//Fifth International Conference on Digital Information Processing & Communications.IEEE,2015.
[14]LIU D L,LIU X,MA L,et al.Domain adaptation of web data extraction based on bootstrapping method[C]//International Conference on Electronics.2017.
[15]GULHANE P,MADAAN A,MEHTA R,et al.Web-scale information extraction with vertex[C]//2011 IEEE 27th International Conference on Data Engineering.IEEE,2011:1209-1220.
[16]WONG T L,LAM W.Adapting Web information extractionknowledge via mining site-invariant and site-dependent features[J].ACM Transactions on Internet Technology,2007,7(1):6.
[17]YANG P,ZHENG Q L,PENG H,et al.A stepwise learning approach to automatic discovery of interest data blocks[C]//Proceedings of 2004 International Conference on Machine Learning and Cybernetics.IEEE,2004:1441-1446.
[18]DENG J S,ZHENG Q L,PENG H.Web page information extraction based on keyword clustering and node distance[J].Computer Science,2007(4):217-220.
[19]CHANG Y S.Adaptable wrapper generation for web page format change[C]//Proc.5th Int.Conf.on Applied Computer Science.World Scientific and Engineering Academy and Society,Stevens Point,Wisconsin,USA,2006:147-152.
[20]LIU D,MA L,LIU X.Research on Adaptive Wrapper in Deep Web Data Extraction[C]//International Conference on Internet of Vehicles.Cham:Springer,2015:409-423.
[21]TEKALE A A,NANDGAONKAR S S.Automatic wrapper adaptation system[J].International Journal of Scientific & Engi-
neering Research,2013,4(3):7.
[22]REIS D C,GOLGHER P B,SILVA A S,et al.Automatic webnews extraction using tree edit distance[C]//Proceedings of the 13th International Conference on World Wide Web (WWW 2004).ACM,2004:502-511.
[23]KIM Y,PARK J,KIM T,et al.Web Information Extraction byHTML Tree Edit Distance Matching[C]//Proceedings of the 5th International Conference on Convergence Information Technology.ACM,2007:2455-2460.
[24]FERRARA E,BAUMGARTNER R.Automatic wrapper adap-tation by tree edit distance matching[M]//Combinations of IntelligentMethodsandApplications.Berlin:Springer,2011:41-54.
[25]JOSHI S,AGRAWAL N,KRISHNAPURAM R,et al.A bag of paths mode! for measuring structural similarity in Web documents[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:577-582.
[26]FERRARA E,DE MEO P,FIUMARA G,et al.Web data extraction,applications and techniques:A survey[J].Knowledge-Based Systems,2014,70:301-323.
[1] 吴子仪, 李邵梅, 姜梦函, 张建朋.
基于自注意力模型的本体对齐方法
Ontology Alignment Method Based on Self-attention
计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190
[2] 刘高聪, 罗永平, 金培权.
基于热点数据的持久性内存索引查询加速
Accelerating Persistent Memory-based Indices Based on Hotspot Data
计算机科学, 2022, 49(8): 26-32. https://doi.org/10.11896/jsjkx.210700176
[3] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[4] 陈俊, 何庆, 李守玉.
基于自适应反馈调节因子的阿基米德优化算法
Archimedes Optimization Algorithm Based on Adaptive Feedback Adjustment Factor
计算机科学, 2022, 49(8): 237-246. https://doi.org/10.11896/jsjkx.210700150
[5] 王杰, 李晓楠, 李冠宇.
基于自适应注意力机制的知识图谱补全算法
Adaptive Attention-based Knowledge Graph Completion
计算机科学, 2022, 49(7): 204-211. https://doi.org/10.11896/jsjkx.210400129
[6] 唐枫, 冯翔, 虞慧群.
基于自适应知识迁移与资源分配的多任务协同优化算法
Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation
计算机科学, 2022, 49(7): 254-262. https://doi.org/10.11896/jsjkx.210600184
[7] 王毅, 李政浩, 陈星.
基于用户场景的Android 应用服务推荐方法
Recommendation of Android Application Services via User Scenarios
计算机科学, 2022, 49(6A): 267-271. https://doi.org/10.11896/jsjkx.210700123
[8] 谭任深, 徐龙博, 周冰, 荆朝霞, 黄向生.
海上风电场通用运维路径规划模型优化及仿真
Optimization and Simulation of General Operation and Maintenance Path Planning Model for Offshore Wind Farms
计算机科学, 2022, 49(6A): 795-801. https://doi.org/10.11896/jsjkx.210400300
[9] 周天清, 岳亚莉.
超密集物联网络中多任务多步计算卸载算法研究
Multi-Task and Multi-Step Computation Offloading in Ultra-dense IoT Networks
计算机科学, 2022, 49(6): 12-18. https://doi.org/10.11896/jsjkx.211200147
[10] 高越, 傅湘玲, 欧阳天雄, 陈松龄, 闫晨巍.
基于时空自适应图卷积神经网络的脑电信号情绪识别
EEG Emotion Recognition Based on Spatiotemporal Self-Adaptive Graph ConvolutionalNeural Network
计算机科学, 2022, 49(4): 30-36. https://doi.org/10.11896/jsjkx.210900200
[11] 赵亮, 张洁, 陈志奎.
基于双图正则化的自适应多模态鲁棒特征学习
Adaptive Multimodal Robust Feature Learning Based on Dual Graph-regularization
计算机科学, 2022, 49(4): 124-133. https://doi.org/10.11896/jsjkx.210300078
[12] 林利祥, 刘旭东, 刘少腾, 徐跃东.
前向纠错编码在网络传输协议中的应用综述
Survey on the Application of Forward Error Correction Coding in Network Transmission Protocols
计算机科学, 2022, 49(2): 292-303. https://doi.org/10.11896/jsjkx.210500104
[13] 陈乐, 高岭, 任杰, 党鑫, 王祎昊, 曹瑞, 郑杰, 王海.
基于自适应码率移动增强现实应用的能效优化研究
Adaptive Bitrate Streaming for Energy-Efficiency Mobile Augmented Reality
计算机科学, 2022, 49(1): 194-203. https://doi.org/10.11896/jsjkx.201100107
[14] 刘凯, 张宏军, 陈飞琼.
基于领域适应嵌入的军事命名实体识别
Name Entity Recognition for Military Based on Domain Adaptive Embedding
计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[15] 梁剑, 何军辉.
基于宏块编码信息自适应置换的H.264/AVC视频加密方法
H.264/AVC Video Encryption Based on Adaptive Permutation of Macroblock Coding Information
计算机科学, 2022, 49(1): 314-320. https://doi.org/10.11896/jsjkx.201100089
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!