面向多数据源的网络爬虫实现技术及应用

doi:10.11896/j.issn.1002-137X.2019.05.047

Abstract

Abstract: The research of social computing method based on big data technology is the hot spot in the academic circle,and how to obtain the corresponding data resources from the network is the key to the research.At present,network crawlertechno-logy is the main method to collect network data.In light of the problem that the existing crawler technology is not easy to collect multi-source data,this paper proposed a network-crawler data-acquisition technology facing multi-data sources.On the basis of six data collection crawlers on media platforms including Sina micro-blog,People’s Daily,Baidu Baike,Baidu Tieba,wechat public account and Easter Wealth Stock Bar,the Web crawlers for multiple data sources are fused to solve the problem of data collection for different media platforms by backstage scheduling technology Servlet.During theimplementation process,firstly,the Web application test kit selenium is used to simulate the artificial actions like logining,then the element query technology Xpath is used to analyze the source code of the Web page and extract the data information and put them into the database,finally the data crawled from multi sourcesare read out from database and displayed on front webpages.Experiments show that the crawler achieves the maximization of acquisition efficiency under the premise of ensuring data integrity.

Key words: Data acquisition, Data display, Information processing, Multiple data source, Network crawler

CLC Number:

TP391.1

ZENG Jian-rong, ZHANG Yang-sen, ZHENG Jia, HUANG Gai-juan, CHEN Ruo-yu. Implementation Technology and Application of Web Crawler for Multi-data Sources[J].Computer Science, 2019, 46(5): 304-309.

References

[1]GAO F,LIU Z,GAO H.A universal vertical crawler with supervised breadth first search strategy[J].Computer Enginee-ring,2018,44(11):295-305.(in Chinese)高峰,刘震,高辉.结合有监督广度优先搜索策略的通用垂直爬虫[J].计算机工程,2018,44(11):295-305.
[2]LIU P F.Design and Implementation of Jingdong Mall PublicOpinion Analysis System[D].Chengdu:University of Electronic Science and Technology of China,2015.(in Chinese)刘鹏飞.京东商城舆情分析系统的设计与实现[D].成都:电子科技大学,2015.
[3]ZHAO J.Research on data acquisition and analysis method of social network[D].Zhengzhou:Zhengzhou University,2015.(in Chinese)赵俊.社交网络的数据采集与分析方法研究[D].郑州:郑州大学,2015.
[4]LUO M.User data acquisition technology of sina micro-blog based on Python [J].Electronics World,2018(5):138-139.(in Chinese)罗咪.基于Python的新浪微博用户数据获取技术[J].电子世界,2018(5):138-139.
[5]MADHUSUDAN P A,POONAM D.Deep web crawling effi-ciently using dynamic focused web crawler[J].International Research Journal of Engineering and Technology,2017,4(6):3303-3306.
[6]OH H J,WON D H,KIM C,et al.Design and implementation of crawling algorithm to collect deep web information for web archiving[J].Data Technologies and Applications,2018,52(2):266-277.
[7]KUMAR M,BINDAL A,GAUTAM R,et al.Keyword querybased focused Web crawler[J].Procedia Computer Science,2018,125:584-590.
[8]PRUTHI J,MONIKA.Implementation of Category-Wise Fo-cused Web Crawler[M]∥Big Data Analytics.Advances in Intelligen Systems and Computing.Springer,Singapore,2017:565-574.
[9]KIM T J,KIM H J.Machine Learning-Based Topical WebCrawler:An Ensemble Approach Incorporating Meta-Features[J].Journal of Engineering and Applied Sciences,2017,12(18):4651-4656.
[10]AGRE G H,MAHAJAN N V.Keyword focused web crawler[C]∥Proceedings of IEEE International Conference on Electronics and Communication Systems (ICECS).New York:IEEE Press,2015:1089-1092.
[11]SUN B.Design and implementation of multi thread crawlerbased on Python [J].Network Security Technology & Application,2018(4):38-39.(in Chinese)孙冰.基于Python的多线程网络爬虫的设计与实现[J].网络安全技术与应用,2018(4):38-39.
[12]HU P R,LI S J.Focused crawler based on URL patterns [J].Application Research of Computers,2018,35(3):694-699,726.(in Chinese)胡萍瑞,李石君.基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699,726.
[13]MENG Q H.Design and Implementation of Internet Data Incremental Acquisition System [D].Beijing:Beijing University of Posts and Telecommunications,2015.(in Chinese)孟庆浩.互联网数据增量采集系统的设计与实现[D].北京:北京邮电大学,2015.
[14]WANG L Y.Research On Hot News Topic Detection Of Incremental Clustering [D].Nanning:Guangxi University for Nationalities,2017.(in Chinese)王丽颖.增量式聚类的新闻热点话题发现研究[D].南宁:广西民族大学,2017.
[15]LIU Y,ZHENG C H.Research on Deep Network Crawler Based on Scrapy [J].Computer Engineering & Software,2017,38(7):111-114.(in Chinese)刘宇,郑成焕.基于Scrapy的深层网络爬虫研究[J].软件,2017,38(7):111-114.
[16]SUN Q Y,WANG J F,ZHAO Z Q,et al.A micro-blog data acquisition scheme based On simulated login [J].Computer Technology and Development,2014,24(3):6-10.(in Chinese)孙青云,王俊峰,赵宗渠,等.一种基于模拟登录的微博数据采集方案[J].计算机技术与发展,2014,24(3):6-10.
[17]WANG S Y,WANG X,SUN J Z.Web application test script repair based on XPath [J].Application Research of Computers,2017,34(5):1393-1396.(in Chinese)王曙燕,王璇,孙家泽.基于XPath路径的Web应用测试脚本修复[J].计算机应用研究,2017,34(5):1393-1396.
[18]SHI S S.Research on key technology and system of accurateWeb information extraction [D].Nanjing:Nanjing University,2017.(in Chinese)施生生.精确Web信息抽取关键技术与系统研究[D].南京:南京大学,2017.
[19]YU Y.Web crawler data collection for Weibo [J].China CIO News,2017(12):36-37.(in Chinese)于营.面向微博的网络爬虫数据采集[J].信息系统工程,2017(12):36-37.

Related Articles 15

[1]	PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin. Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning [J]. Computer Science, 2021, 48(8): 200-208.
[2]	SHU Yun-feng and WANG Zhong-qing. Research on Chinese Patent Summarization Based on Patented Structure [J]. Computer Science, 2020, 47(6A): 45-48.
[3]	WU Xiao-kun, ZHAO Tian-fang. Application of Natural Language Processing in Social Communication:A Review and Future Perspectives [J]. Computer Science, 2020, 47(6): 184-193.
[4]	FENG An-qi, QIAN Li-ping, HUANG Yu-pin, WU Yuan. RFID Data-driven Vehicle Speed Prediction Using Adaptive Kalman Filter [J]. Computer Science, 2019, 46(4): 100-105.
[5]	HUANG Guo-rui, GUO Kang, WANG Shi-gui, JIANG Jin-bo. Design and Implementation ofHandheld Data Acquisition Terminal [J]. Computer Science, 2019, 46(11A): 567-569.
[6]	Munina YUSUFU and Gulina YUSUFU. Repetitive Pattern Recognition Algorithms and Applications in Web Information Extraction and Clustering Analysis [J]. Computer Science, 2017, 44(Z11): 39-45.
[7]	FENG Yun-tian ZHANG Hong-jun HAO Wen-ning. Named Entity Recognition for Military Text [J]. Computer Science, 2015, 42(7): 15-18.
[8]	XIE Tian-bao,ZHANG Xiao-wen and WU Kai-bo. Important User Node Screening and Public Opinion Guiding for Microblogging Social Network [J]. Computer Science, 2014, 41(Z6): 400-405.
[9]	SHU Da-you,FENG Xuan,LU Jun and GUO Ben-jun. Design and Implementation of Distributed SCADA System [J]. Computer Science, 2013, 40(8): 83-85.
[10]	QIU Quan-qing,MIAO Duo-qian and ZHANG Zhi-fei. Named Entity Recognition on Chinese Microblog [J]. Computer Science, 2013, 40(6): 196-198.
[11]	YANG Shan-liang,ZHAO Xin-ye,YANG Mei,FU Yue-wen and ZHOU Yun. Research on WMD Model in Military Simulation and Analysis System [J]. Computer Science, 2013, 40(11): 14-17.
[12]	. Method to Identi行the Future Stocl} Investor Sentiment Orientation on Chinese Micro-blog [J]. Computer Science, 2012, 39(Z6): 249-252.
[13]	ZHAO Zhi-jun,SHEN Qiang,TANG Hui, FANG Xu-ming. Theory and Key Technologies of Architecture and Intelligent Information Processing for Internet of Things [J]. Computer Science, 2011, 38(8): 1-8.
[14]	GUO Wen-hong, FAN Xue-feng （College of Electronics and Information Engineering,Tongji University,Shanghai 201804,China）. [J]. Computer Science, 2009, 36(4): 218-220.
[15]	GUO Wen-hong ,FAN Xue-feng （College of Electronics and Information Engineering, Tongji University, Shanghai 201804, China）. [J]. Computer Science, 2009, 36(1): 201-204.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Implementation Technology and Application of Web Crawler for Multi-data Sources

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0