计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 218-224.doi: 10.11896/jsjkx.210100230
陈迎仁, 郭莹楠, 郭享, 倪一涛, 陈星
CHEN Ying-ren, GUO Ying-nan, GUO Xiang, NI Yi-tao, CHEN Xing
摘要: 随着大数据的发展,互联网数据呈现爆炸式的增长。Web作为一种重要的信息载体,包含了各种类型的信息,而包装器的提出就是为了从杂乱的Web信息中提取出目标数据。但是,随着网页更新的频繁,轻微的结构变化都可能导致原有的包装器失效,增加包装器的维护成本。针对包装器的健壮性以及维护成本问题,提出了一种基于特征相似度计算的网页包装器自适应技术。该技术主要通过解析新网页的特征集合和旧包装器所蕴含的特征信息,通过网页相似度计算,重定位旧包装器在新网页中的映射区域和映射数据项,并根据映射关系使旧包装器能够自适应新网页的数据提取。该技术主要针对各类型网站进行实验,其中包括了购物类、新闻类、资讯类、论坛类和服务类,从中选取了250对新旧版本网页,共500个网页,进行包装器自适应实验。实验结果表明,当网页结构改变时,该方法能够有效地自适应新网页的数据提取,且数据提取的平均精确度和平均召回值分别达到 82.2%和 84.36%。
中图分类号:
[1]CNNIĆs 45th Statistical Report on the Development of China'sInternet[EB/OL].http://www.cac.gov.cn/2020-04/27/c_1589535470378587.html. [2]CUI C,GONG J.Overview of Web Information Extraction Research[J].Computer Knowledge and Technology:Academic Exchange,2011,7(4):2279-2280. [3]CAFARELLA M J,HALEVY A Y,WANG D Z,et al.Web-Tables:Exploring the power of tables on the web[J].Procee-dings of the VLDB Endowment,2008,1(1):538-549. [4]ZHANG J.Research and Implementation of Web InformationAutomatic Extraction Technology[D].Wuhan:Wuhan University of Technology,2009. [5]EMILIO F,ROBERT B.Automatic Wrapper Adaptation byTree Edit Distance Matching[C]//Proceedings of the 2nd International CIMA Workshop.Springer,2011:41-54. [6]CHIDLOVSKII B.Automatic Repairing of Web Wrappers[C]//Proceeding of the Third International Workshop.ACM,2001:24-30. [7]KNOBLOCK C A,LERMAN K,MINTON S N.Wrapper Maintenance:A Machine Learning Approach[J].Computer Science,2011,18(1):2003. [8]MENG X,HU D,LI C.Schema-guided wrapper maintenance for web-data extraction[C]//Fifth ACM CIKM International Workshop on Web Information and Data Management.ACM,2003:1-8. [9]KOWALKIEWICZ M,KACZMAREK T,ABRAMOWICZ W.myPortal:Robust Extraction and Aggregation of Web Content[C]//Proceedings of the 32nd International Conference on Very Large Data Bases.DBLP,2006:1219-1222. [10]DALVI N N,BOHANNON P,SHA F.Robust web extraction:an approach based on a probabilistic tree-edit model[C]//ACM Sigmod International Conference on Management of Data.ACM,2009:335-348. [11]LEOTTA M,STOCCO A,RICCA F,et al.Reducing Web Test Cases Aging by Means of Robust XPath Locators[C]//IEEE International Symposium on Software Reliability Engineering Workshops.IEEE,2014:449-454. [12]LIU D,WANG X,LI H,et al.Robust Web Extraction Based on Minimum Cost Script Edit Model[J].Procedia Engineering,2012,29(1):1119-1125. [13]CHU Y C,HSU C C,LEE C J,et al.Automatic data extraction of websites using data path matching and alignment[C]//Fifth International Conference on Digital Information Processing & Communications.IEEE,2015. [14]LIU D L,LIU X,MA L,et al.Domain adaptation of web data extraction based on bootstrapping method[C]//International Conference on Electronics.2017. [15]GULHANE P,MADAAN A,MEHTA R,et al.Web-scale information extraction with vertex[C]//2011 IEEE 27th International Conference on Data Engineering.IEEE,2011:1209-1220. [16]WONG T L,LAM W.Adapting Web information extractionknowledge via mining site-invariant and site-dependent features[J].ACM Transactions on Internet Technology,2007,7(1):6. [17]YANG P,ZHENG Q L,PENG H,et al.A stepwise learning approach to automatic discovery of interest data blocks[C]//Proceedings of 2004 International Conference on Machine Learning and Cybernetics.IEEE,2004:1441-1446. [18]DENG J S,ZHENG Q L,PENG H.Web page information extraction based on keyword clustering and node distance[J].Computer Science,2007(4):217-220. [19]CHANG Y S.Adaptable wrapper generation for web page format change[C]//Proc.5th Int.Conf.on Applied Computer Science.World Scientific and Engineering Academy and Society,Stevens Point,Wisconsin,USA,2006:147-152. [20]LIU D,MA L,LIU X.Research on Adaptive Wrapper in Deep Web Data Extraction[C]//International Conference on Internet of Vehicles.Cham:Springer,2015:409-423. [21]TEKALE A A,NANDGAONKAR S S.Automatic wrapper adaptation system[J].International Journal of Scientific & Engi- neering Research,2013,4(3):7. [22]REIS D C,GOLGHER P B,SILVA A S,et al.Automatic webnews extraction using tree edit distance[C]//Proceedings of the 13th International Conference on World Wide Web (WWW 2004).ACM,2004:502-511. [23]KIM Y,PARK J,KIM T,et al.Web Information Extraction byHTML Tree Edit Distance Matching[C]//Proceedings of the 5th International Conference on Convergence Information Technology.ACM,2007:2455-2460. [24]FERRARA E,BAUMGARTNER R.Automatic wrapper adap-tation by tree edit distance matching[M]//Combinations of IntelligentMethodsandApplications.Berlin:Springer,2011:41-54. [25]JOSHI S,AGRAWAL N,KRISHNAPURAM R,et al.A bag of paths mode! for measuring structural similarity in Web documents[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:577-582. [26]FERRARA E,DE MEO P,FIUMARA G,et al.Web data extraction,applications and techniques:A survey[J].Knowledge-Based Systems,2014,70:301-323. |
[1] | 吴子仪, 李邵梅, 姜梦函, 张建朋. 基于自注意力模型的本体对齐方法 Ontology Alignment Method Based on Self-attention 计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190 |
[2] | 刘高聪, 罗永平, 金培权. 基于热点数据的持久性内存索引查询加速 Accelerating Persistent Memory-based Indices Based on Hotspot Data 计算机科学, 2022, 49(8): 26-32. https://doi.org/10.11896/jsjkx.210700176 |
[3] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[4] | 陈俊, 何庆, 李守玉. 基于自适应反馈调节因子的阿基米德优化算法 Archimedes Optimization Algorithm Based on Adaptive Feedback Adjustment Factor 计算机科学, 2022, 49(8): 237-246. https://doi.org/10.11896/jsjkx.210700150 |
[5] | 王杰, 李晓楠, 李冠宇. 基于自适应注意力机制的知识图谱补全算法 Adaptive Attention-based Knowledge Graph Completion 计算机科学, 2022, 49(7): 204-211. https://doi.org/10.11896/jsjkx.210400129 |
[6] | 唐枫, 冯翔, 虞慧群. 基于自适应知识迁移与资源分配的多任务协同优化算法 Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation 计算机科学, 2022, 49(7): 254-262. https://doi.org/10.11896/jsjkx.210600184 |
[7] | 王毅, 李政浩, 陈星. 基于用户场景的Android 应用服务推荐方法 Recommendation of Android Application Services via User Scenarios 计算机科学, 2022, 49(6A): 267-271. https://doi.org/10.11896/jsjkx.210700123 |
[8] | 谭任深, 徐龙博, 周冰, 荆朝霞, 黄向生. 海上风电场通用运维路径规划模型优化及仿真 Optimization and Simulation of General Operation and Maintenance Path Planning Model for Offshore Wind Farms 计算机科学, 2022, 49(6A): 795-801. https://doi.org/10.11896/jsjkx.210400300 |
[9] | 周天清, 岳亚莉. 超密集物联网络中多任务多步计算卸载算法研究 Multi-Task and Multi-Step Computation Offloading in Ultra-dense IoT Networks 计算机科学, 2022, 49(6): 12-18. https://doi.org/10.11896/jsjkx.211200147 |
[10] | 高越, 傅湘玲, 欧阳天雄, 陈松龄, 闫晨巍. 基于时空自适应图卷积神经网络的脑电信号情绪识别 EEG Emotion Recognition Based on Spatiotemporal Self-Adaptive Graph ConvolutionalNeural Network 计算机科学, 2022, 49(4): 30-36. https://doi.org/10.11896/jsjkx.210900200 |
[11] | 赵亮, 张洁, 陈志奎. 基于双图正则化的自适应多模态鲁棒特征学习 Adaptive Multimodal Robust Feature Learning Based on Dual Graph-regularization 计算机科学, 2022, 49(4): 124-133. https://doi.org/10.11896/jsjkx.210300078 |
[12] | 林利祥, 刘旭东, 刘少腾, 徐跃东. 前向纠错编码在网络传输协议中的应用综述 Survey on the Application of Forward Error Correction Coding in Network Transmission Protocols 计算机科学, 2022, 49(2): 292-303. https://doi.org/10.11896/jsjkx.210500104 |
[13] | 陈乐, 高岭, 任杰, 党鑫, 王祎昊, 曹瑞, 郑杰, 王海. 基于自适应码率移动增强现实应用的能效优化研究 Adaptive Bitrate Streaming for Energy-Efficiency Mobile Augmented Reality 计算机科学, 2022, 49(1): 194-203. https://doi.org/10.11896/jsjkx.201100107 |
[14] | 刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007 |
[15] | 梁剑, 何军辉. 基于宏块编码信息自适应置换的H.264/AVC视频加密方法 H.264/AVC Video Encryption Based on Adaptive Permutation of Macroblock Coding Information 计算机科学, 2022, 49(1): 314-320. https://doi.org/10.11896/jsjkx.201100089 |
|