Computer Science ›› 2021, Vol. 48 ›› Issue (11A): 218-224.doi: 10.11896/jsjkx.210100230

• Big Data & Data Science • Previous Articles     Next Articles

Web Page Wrapper Adaptation Based on Feature Similarity Calculation

CHEN Ying-ren, GUO Ying-nan, GUO Xiang, NI Yi-tao, CHEN Xing   

  1. College of Mathematics and Computer Science,Fuzhou University,Fuzhou 350108,China
    Fujian Key Laboratory of Network Computing and Intelligent Information Processing (Fuzhou University),Fuzhou 350108,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:CHEN Ying-ren,born in 1997,postgra-duate.His main research interests include software adaptation and know-ledge mapping.
    NI Yi-tao,born in 1969,Ph.D,is a member of China Computer Federation.His main research interests include software engineering,system security and so on.
  • Supported by:
    National Key R&D Program of China(2017YFB1002000),Natural Science Foundation of Fujian Province for Distinguished Young Scholars(2020J06014) and Natural Science Foundation of Fujian Province(2018J07005).

Abstract: With the development of big data,Internet data has exploded.As an important information carrier,the Web contains various types of information.The wrapper is proposed to extract target data from messy Web information.However,with frequent Web page updates,minor structural changes may cause the original wrapper to fail,leading to increased maintenance costs for the wrapper.Aiming at the robustness and maintenance cost of the wrapper,a Web page wrapper adaptive technology based on feature similarity calculation is proposed.This technology mainly analyzes the feature set of the new Web page and the feature information contained in the old wrapper,and calculates the similarity of the Web page to relocate the mapping area and mapping data items of the old wrapper in the new Web page,and make the old wrapper based on the mapping relationship able to adapt the data extraction of new Web pages.The technology is mainly used for experiments on various types of Websites,including shopping,news,information,forums and services.250 pairs of old and new versions of Web pages,totaling 500 Web pages,are selected for wrapper adaptation experiments.The experimental results show that when the Web page structure changes,the method can effectively adapt to the data extraction of the new Web page,and the average precision and average recall of data extraction reach 82.2% and 84.36%,respectively.

Key words: Adaptation, Page features, Similarity calculation, Web page data extraction, Wrapper

CLC Number: 

  • TP311
[1]CNNIĆs 45th Statistical Report on the Development of China'sInternet[EB/OL].http://www.cac.gov.cn/2020-04/27/c_1589535470378587.html.
[2]CUI C,GONG J.Overview of Web Information Extraction Research[J].Computer Knowledge and Technology:Academic Exchange,2011,7(4):2279-2280.
[3]CAFARELLA M J,HALEVY A Y,WANG D Z,et al.Web-Tables:Exploring the power of tables on the web[J].Procee-dings of the VLDB Endowment,2008,1(1):538-549.
[4]ZHANG J.Research and Implementation of Web InformationAutomatic Extraction Technology[D].Wuhan:Wuhan University of Technology,2009.
[5]EMILIO F,ROBERT B.Automatic Wrapper Adaptation byTree Edit Distance Matching[C]//Proceedings of the 2nd International CIMA Workshop.Springer,2011:41-54.
[6]CHIDLOVSKII B.Automatic Repairing of Web Wrappers[C]//Proceeding of the Third International Workshop.ACM,2001:24-30.
[7]KNOBLOCK C A,LERMAN K,MINTON S N.Wrapper Maintenance:A Machine Learning Approach[J].Computer Science,2011,18(1):2003.
[8]MENG X,HU D,LI C.Schema-guided wrapper maintenance for web-data extraction[C]//Fifth ACM CIKM International Workshop on Web Information and Data Management.ACM,2003:1-8.
[9]KOWALKIEWICZ M,KACZMAREK T,ABRAMOWICZ W.myPortal:Robust Extraction and Aggregation of Web Content[C]//Proceedings of the 32nd International Conference on Very Large Data Bases.DBLP,2006:1219-1222.
[10]DALVI N N,BOHANNON P,SHA F.Robust web extraction:an approach based on a probabilistic tree-edit model[C]//ACM Sigmod International Conference on Management of Data.ACM,2009:335-348.
[11]LEOTTA M,STOCCO A,RICCA F,et al.Reducing Web Test Cases Aging by Means of Robust XPath Locators[C]//IEEE International Symposium on Software Reliability Engineering Workshops.IEEE,2014:449-454.
[12]LIU D,WANG X,LI H,et al.Robust Web Extraction Based on Minimum Cost Script Edit Model[J].Procedia Engineering,2012,29(1):1119-1125.
[13]CHU Y C,HSU C C,LEE C J,et al.Automatic data extraction of websites using data path matching and alignment[C]//Fifth International Conference on Digital Information Processing & Communications.IEEE,2015.
[14]LIU D L,LIU X,MA L,et al.Domain adaptation of web data extraction based on bootstrapping method[C]//International Conference on Electronics.2017.
[15]GULHANE P,MADAAN A,MEHTA R,et al.Web-scale information extraction with vertex[C]//2011 IEEE 27th International Conference on Data Engineering.IEEE,2011:1209-1220.
[16]WONG T L,LAM W.Adapting Web information extractionknowledge via mining site-invariant and site-dependent features[J].ACM Transactions on Internet Technology,2007,7(1):6.
[17]YANG P,ZHENG Q L,PENG H,et al.A stepwise learning approach to automatic discovery of interest data blocks[C]//Proceedings of 2004 International Conference on Machine Learning and Cybernetics.IEEE,2004:1441-1446.
[18]DENG J S,ZHENG Q L,PENG H.Web page information extraction based on keyword clustering and node distance[J].Computer Science,2007(4):217-220.
[19]CHANG Y S.Adaptable wrapper generation for web page format change[C]//Proc.5th Int.Conf.on Applied Computer Science.World Scientific and Engineering Academy and Society,Stevens Point,Wisconsin,USA,2006:147-152.
[20]LIU D,MA L,LIU X.Research on Adaptive Wrapper in Deep Web Data Extraction[C]//International Conference on Internet of Vehicles.Cham:Springer,2015:409-423.
[21]TEKALE A A,NANDGAONKAR S S.Automatic wrapper adaptation system[J].International Journal of Scientific & Engi-
neering Research,2013,4(3):7.
[22]REIS D C,GOLGHER P B,SILVA A S,et al.Automatic webnews extraction using tree edit distance[C]//Proceedings of the 13th International Conference on World Wide Web (WWW 2004).ACM,2004:502-511.
[23]KIM Y,PARK J,KIM T,et al.Web Information Extraction byHTML Tree Edit Distance Matching[C]//Proceedings of the 5th International Conference on Convergence Information Technology.ACM,2007:2455-2460.
[24]FERRARA E,BAUMGARTNER R.Automatic wrapper adap-tation by tree edit distance matching[M]//Combinations of IntelligentMethodsandApplications.Berlin:Springer,2011:41-54.
[25]JOSHI S,AGRAWAL N,KRISHNAPURAM R,et al.A bag of paths mode! for measuring structural similarity in Web documents[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:577-582.
[26]FERRARA E,DE MEO P,FIUMARA G,et al.Web data extraction,applications and techniques:A survey[J].Knowledge-Based Systems,2014,70:301-323.
[1] WU Zi-yi, LI Shao-mei, JIANG Meng-han, ZHANG Jian-peng. Ontology Alignment Method Based on Self-attention [J]. Computer Science, 2022, 49(9): 215-220.
[2] WANG Yi, LI Zheng-hao, CHEN Xing. Recommendation of Android Application Services via User Scenarios [J]. Computer Science, 2022, 49(6A): 267-271.
[3] NING Qiu-yi, SHI Xiao-jing, DUAN Xiang-yu, ZHANG Min. Unsupervised Domain Adaptation Based on Style Aware [J]. Computer Science, 2022, 49(1): 271-278.
[4] LIU Kai, ZHANG Hong-jun, CHEN Fei-qiong. Name Entity Recognition for Military Based on Domain Adaptive Embedding [J]. Computer Science, 2022, 49(1): 292-297.
[5] WU Lan, WANG Han, LI Bin-quan. Unsupervised Domain Adaptive Method Based on Optimal Selection of Self-supervised Tasks [J]. Computer Science, 2021, 48(6A): 357-363.
[6] MA Chuang, TIAN Qing, SUN He-yang, CAO Meng, MA Ting-huai. Unsupervised Domain Adaptation Based on Weighting Dual Biases [J]. Computer Science, 2021, 48(2): 217-223.
[7] LIU Shan-shan, ZHU Hai-long, HAN Xiao-xia, MU Quan-qi, HE Wei. Enterprise Risk Assessment Model Based on Principal Component Regression and HierarchicalBelief Rule Base [J]. Computer Science, 2021, 48(11A): 570-575.
[8] YUAN Chen-hui, CHENG Chun-ling. Deep Domain Adaptation Algorithm Based on PE Divergence Instance Filtering [J]. Computer Science, 2020, 47(8): 151-156.
[9] WANG Jing-yu, LIU Si-rui. Research Progress on Risk Access Control [J]. Computer Science, 2020, 47(7): 56-65.
[10] SHI Chao-wei, MENG Xiang-ru, MA Zhi-qiang, HAN Xiao-yang. Virtual Network Embedding Algorithm Based on Topology Comprehensive Evaluation and Weight Adaptation [J]. Computer Science, 2020, 47(7): 236-242.
[11] ZHONG Ya,GUO Yuan-bo,LIU Chun-hui,LI Tao. User Attributes Profiling Method and Application in Insider Threat Detection [J]. Computer Science, 2020, 47(3): 292-297.
[12] TAN Jian-hao, YIN Wang, LIU Li-ming, WANG Yao-nan. Robust Long-term Adaptive Object Tracking Based onMulti-correlation Filtering Strategy [J]. Computer Science, 2020, 47(12): 169-176.
[13] YANG Pei-jian, WU Xiao-fu, ZHANG Suo-fei, ZHOU Quan. Semantic Segmentation Transfer Algorithm Based on Atrous Convolution Discriminator [J]. Computer Science, 2020, 47(11): 174-178.
[14] LI Fang,LI Zhi-hui,XU Jin-xiu,FAN Hao,CHU Xue-sen,LI Xin-liang. Research on Adaptation of CFD Software Based on Many-core Architecture of 100P Domestic Supercomputing System [J]. Computer Science, 2020, 47(1): 24-30.
[15] XU Fei-xiang,YE Xia,LI Lin-lin,CAO Jun-bo,WANG Xin. Comprehensive Calculation of Semantic Similarity of Ontology Concept Based on SA-BP Algorithm [J]. Computer Science, 2020, 47(1): 199-204.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!