Computer Science ›› 2019, Vol. 46 ›› Issue (10): 63-70.doi: 10.11896/jsjkx.190200346

• Big Data & Data Science • Previous Articles     Next Articles

Multi-recording Complex Webpage Information Extraction Algorithm Based on Visual Block

WANG Wei-hong, LIANG Chao-kai, MIN Yong   

  1. (College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China)
  • Received:2019-02-23 Revised:2019-05-19 Online:2019-10-15 Published:2019-10-21

Abstract: The webpage has rich content and complicated and varied structure.The existing webpage information extraction technology solves the information extraction of the single-recording simple page,but the information extraction effect of the multi-recording complex page is often poor.This paper proposed a new visual block based information extraction algorithm,named visual block based information extraction (VBIE).By constructing visual blocks and visual block trees,and through heuristic rules,regional focus,noise filtering and visual block filtering,data record extraction is realized for complex webpages.Compared with other existing methods,this method abandons the specific assumptions of the previous algorithm on the structure of the webpage,does not need to manually mark the HTML document,preserves the original structure of the webpage,and can realize unsupervised information extraction on a single page.The experimental results show that VBIE’s webpage information extraction accuracy is up to 100%,and the average value of F1 on the results page of the mainstream search engine and the post page of the community forum are 98.5% and 96.1%.Compared with the current method CMDR,the F1 value of VBIE is improved by nearly 16.3%,which proves that the method can effectively solve the information extraction task of complex webpages.

Key words: Data record extraction, Structured information, Web data extraction, Web mining, Webpage data extraction

CLC Number: 

  • TP391
[1]中国互联网络信息中心.CNNIC 发布第43次《中国互联网络发展状况统计报告》 [EB/OL].(2019-02-02).http://www.cac.gov.cn/2019-02/28/c_1124175677.html.
[2]HAMMER J,MCHUGH J,GARCIA-MOLIN H.Semistruc-tured data:the TSIMMIS experience[C]//East-European Conference on Advances in Databases and Information Systems.British Computer Society,1997:1-8.
[3]AROCENA G O,MENDELZON A O.WebOQL:restructuring documents,databases and Webs[C]//International Conference on Data Engineering.IEEE,1998:24-33.
[4]NOVELLA T,HOLUBOVÁ I.User-Friendly and Extensible Web Data Extraction[M]//Advances in Information Systems Development.Cham:Springer,2018:225-241.
[5]BU Z,ZHANG C,XIA Z,et al.An FAR-SW based approach for webpage information extraction[J].Information Systems Frontiers,2014,16(5):771-785.
[6]OITA M,SENELLART P.FOREST:Focused object retrieval by exploiting significant tag paths[C]//Proceedings of the 18th International Workshop on Web and Databases.ACM,2015:55-61.
[7]SAHUGUET A,AZAVANT F.Building intelligent web applications using lightweight wrappers[J].Data & Knowledge Engineering,2001,36(3):283-316.
[8]LIU L,PU C,HAN W.XWRAP:An XML-Enabled Wrapper Construction System for Web Information Sources[C]//International Conference on Data Engineering.IEEE,2002.
[9]BUTTLER D,LIU L,PU C.A fully automated object extraction system for the World Wide Web[C]//Proceedings 21st International Conference on Distributed Computing Systems.IEEE,2001:361-370.
[10]CHANG C H,HSU C N,LUI S C.Automatic information extraction from semi-structured web pages by pattern discovery[J].Decision Support Systems,2003,35(1):129-147.
[11]WEN Y,ZENG Q,DUAN H,et al.An Automatic Web Data Extraction Approach based on Path Index Trees[J].International Journal of Performability Engineering,2018,14(10):2449-2460.
[12]LIU B,GROSSMAN R,ZHAI Y.Mining data records in web pages[C]//Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:601-606.
[13]ZHAI Y,LIU B.Web data extraction based on partial tree alignment[C]//Proceedings of the 14th international conference on World Wide Web.ACM,2005:76-85.
[14]HUANG X,GAO Y,HUANG L,et al.Web Content Extraction Using Clustering with Web Structure[C]// International Symposium on Neural Networks.Cham:Springer,2017:95-103.
[15]CAI D,YU S,WEN J R,et al.Vips:a vision-based page segmentation algorithm:Technical Report MSR-TR-2003-79 [R].2003.
[16]ZHAO H,MENG W,WU Z,et al.Fully automatic wrapper generation for search engines[C]//Proceedings of the 14th international conference on World Wide Web.ACM,2005:66-75.
[17]SIMON K,LAUSEN G.ViPER:augmenting automatic information extraction with visual perceptions[C]//Proceedings of the 14th ACM international conference on Information and know-ledge management.ACM,2005:381-388.
[18]LIU W,MENG X,MENG W.Vide:A vision-based approach for deep web data extraction[J].IEEE Transactions on Knowledge and Data Engineering,2009,22(3):447-460.
[19]WAI F K,YONG L W,THING V L L,et al.CMDR:Classifying nodes for mining data records with different HTML structures[C]//TENCON 2017-2017 IEEE Region 10 Conference.IEEE,2017:1862-1862.
[20]LIU J,LIN L,CAI Z,et al.Deep web data extraction based on visual information processing[J].Journal of Ambient Intelligence and Humanized Computing,2017,10(1):1-11.
[21]GOGAR T,HUBACEK O,SEDIVY J.Deep neural networks for web page information extraction[C]//IFIP International Conference on Artificial Intelligence Applications and Innovations.Cham:Springer,2016:154-163.
[1] WANG Zhi-juan and LI Fu-xian. Survey on Cross-language Named Entity Translation Pairs Extraction [J]. Computer Science, 2017, 44(Z6): 14-18.
[2] LI Hui, TANG Meng and CHEN Hao. Summary of Research on Website Structure Optimization Based on User Behaviour Analysis [J]. Computer Science, 2016, 43(Z6): 384-386.
[3] GENG Zeng-min, SHANG Shu-yuan, SHAO Xin-yan, ZHOU Yi-ling and MA Lin. Hierarchical Semantic-based Web Intelligent Fashion Image Retrieval Method [J]. Computer Science, 2016, 43(Z11): 252-255.
[4] . Weh-based Term Translation Extraction and Verification Method [J]. Computer Science, 2012, 39(7): 170-174.
[5] XIANG Dong,ZHAO Yong,CHEN Yang. Method of Structure and Fusion for Uncertainty Seminar Information [J]. Computer Science, 2012, 39(3): 192-195.
[6] ZHU Yan-xu,WANG Huai-min,SHI Dian-x,YIN Gang,YUAN Lin, LI Xiang. Indent Shape Based Approach for Mining Repeated Patterns of HTML Documents [J]. Computer Science, 2011, 38(8): 165-168.
[7] . [J]. Computer Science, 2009, 36(3): 193-195.
[8] GUO Wen-hong ,FAN Xue-feng (College of Electronics and Information Engineering, Tongji University, Shanghai 201804, China). [J]. Computer Science, 2009, 36(1): 201-204.
[9] LI Hong-yu LIU Qing-jiang (Department of Computer and information, A Cheng College, Haerbin Normal University, Haerbin 150301 ,China). [J]. Computer Science, 2008, 35(5): 292-293.
[10] . [J]. Computer Science, 2008, 35(2): 150-153.
[11] TIAN Chang -Peng (Chongqing Technology and Business University, Chongqing, 400067). [J]. Computer Science, 2007, 34(6): 78-80.
[12] REN Zhong-Sheng ,XUE Yong-Sheng (Department of Computer Science, Xiamen University, Xiamen 361005). [J]. Computer Science, 2007, 34(10): 133-136.
[13] DAI Dong-Bo ,YIN Jian (Department of Computer Science, Zhongshan University, Guangzhou 510275). [J]. Computer Science, 2006, 33(4): 126-129.
[14] . [J]. Computer Science, 2006, 33(2): 135-138.
[15] . [J]. Computer Science, 2005, 32(12): 193-196.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!