计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 304-309.doi: 10.11896/j.issn.1002-137X.2019.05.047

• 交叉与前沿 • 上一篇    下一篇

面向多数据源的网络爬虫实现技术及应用

曾健荣, 张仰森, 郑佳, 黄改娟, 陈若愚   

  1. (北京信息科技大学智能信息研究所 北京100101)
  • 发布日期:2019-05-15
  • 作者简介:曾健荣(1995-),男,硕士生,主要研究方向为中文信息处理;张仰森(1962-),男,博士后,教授,CCF高级会员,主要研究方向为中文信息处理、人工智能,E-mail:zhangyangsen@163.com(通信作者);郑 佳(1991-),男,硕士生,主要研究方向为中文信息处理、情感分析;黄改娟(1964-),女,高级实验师,主要研究方向为智能信息处理;陈若愚(1982-),男,博士,讲师,主要研究方向为自然语言处理。
  • 基金资助:
    国家自然科学基金项目(61772081,61602044),北京市教委科研计划项目(KM201711232014)资助。

Implementation Technology and Application of Web Crawler for Multi-data Sources

ZENG Jian-rong, ZHANG Yang-sen, ZHENG Jia, HUANG Gai-juan, CHEN Ruo-yu   

  1. (Institute of Intelligent Information,Beijing Information Science and Technology University,Beijing 100101,China)
  • Published:2019-05-15

摘要: 基于大数据技术的社会计算方法是目前学术界研究的热点,如何从网络上快速获取相应的数据资源是相关研究的关键。网络爬虫技术是目前进行网络数据采集的主要手段,针对现有爬虫技术不便于采集多源数据的问题,提出了一种面向多数据源的网络爬虫数据采集技术,在研究新浪微博、人民日报、百度百科、百度贴吧、微信公众号、东方财富股吧等6类媒体平台的数据采集爬虫的基础上,采用Servlet后台调度技术,将面向多数据源的网络爬虫进行融合,解决了面向不同媒体平台的数据采集问题。在实现过程中,首先借助Web应用程序测试工具包selenium实现模拟登录等人工操作,然后采用Xpath元素查询技术来解析网页源码,并提取出数据信息存入数据库,最后将爬取到的数据从数据库中读取出来并展示在前端页面中。实验表明,爬虫在保证数据完整性的前提下实现了采集效率的最大化。

关键词: 多数据源, 数据采集, 数据展示, 网络爬虫, 信息处理

Abstract: The research of social computing method based on big data technology is the hot spot in the academic circle,and how to obtain the corresponding data resources from the network is the key to the research.At present,network crawlertechno-logy is the main method to collect network data.In light of the problem that the existing crawler technology is not easy to collect multi-source data,this paper proposed a network-crawler data-acquisition technology facing multi-data sources.On the basis of six data collection crawlers on media platforms including Sina micro-blog,People’s Daily,Baidu Baike,Baidu Tieba,wechat public account and Easter Wealth Stock Bar,the Web crawlers for multiple data sources are fused to solve the problem of data collection for different media platforms by backstage scheduling technology Servlet.During theimplementation process,firstly,the Web application test kit selenium is used to simulate the artificial actions like logining,then the element query technology Xpath is used to analyze the source code of the Web page and extract the data information and put them into the database,finally the data crawled from multi sourcesare read out from database and displayed on front webpages.Experiments show that the crawler achieves the maximization of acquisition efficiency under the premise of ensuring data integrity.

Key words: Data acquisition, Data display, Information processing, Multiple data source, Network crawler

中图分类号: 

  • TP391.1
[1]GAO F,LIU Z,GAO H.A universal vertical crawler with supervised breadth first search strategy[J].Computer Enginee-ring,2018,44(11):295-305.(in Chinese)高峰,刘震,高辉.结合有监督广度优先搜索策略的通用垂直爬虫[J].计算机工程,2018,44(11):295-305.
[2]LIU P F.Design and Implementation of Jingdong Mall PublicOpinion Analysis System[D].Chengdu:University of Electronic Science and Technology of China,2015.(in Chinese)刘鹏飞.京东商城舆情分析系统的设计与实现[D].成都:电子科技大学,2015.
[3]ZHAO J.Research on data acquisition and analysis method of social network[D].Zhengzhou:Zhengzhou University,2015.(in Chinese)赵俊.社交网络的数据采集与分析方法研究[D].郑州:郑州大学,2015.
[4]LUO M.User data acquisition technology of sina micro-blog based on Python [J].Electronics World,2018(5):138-139.(in Chinese)罗咪.基于Python的新浪微博用户数据获取技术[J].电子世界,2018(5):138-139.
[5]MADHUSUDAN P A,POONAM D.Deep web crawling effi-ciently using dynamic focused web crawler[J].International Research Journal of Engineering and Technology,2017,4(6):3303-3306.
[6]OH H J,WON D H,KIM C,et al.Design and implementation of crawling algorithm to collect deep web information for web archiving[J].Data Technologies and Applications,2018,52(2):266-277.
[7]KUMAR M,BINDAL A,GAUTAM R,et al.Keyword querybased focused Web crawler[J].Procedia Computer Science,2018,125:584-590.
[8]PRUTHI J,MONIKA.Implementation of Category-Wise Fo-cused Web Crawler[M]∥Big Data Analytics.Advances in Intelligen Systems and Computing.Springer,Singapore,2017:565-574.
[9]KIM T J,KIM H J.Machine Learning-Based Topical WebCrawler:An Ensemble Approach Incorporating Meta-Features[J].Journal of Engineering and Applied Sciences,2017,12(18):4651-4656.
[10]AGRE G H,MAHAJAN N V.Keyword focused web crawler[C]∥Proceedings of IEEE International Conference on Electronics and Communication Systems (ICECS).New York:IEEE Press,2015:1089-1092.
[11]SUN B.Design and implementation of multi thread crawlerbased on Python [J].Network Security Technology & Application,2018(4):38-39.(in Chinese)孙冰.基于Python的多线程网络爬虫的设计与实现[J].网络安全技术与应用,2018(4):38-39.
[12]HU P R,LI S J.Focused crawler based on URL patterns [J].Application Research of Computers,2018,35(3):694-699,726.(in Chinese)胡萍瑞,李石君.基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699,726.
[13]MENG Q H.Design and Implementation of Internet Data Incremental Acquisition System [D].Beijing:Beijing University of Posts and Telecommunications,2015.(in Chinese)孟庆浩.互联网数据增量采集系统的设计与实现[D].北京:北京邮电大学,2015.
[14]WANG L Y.Research On Hot News Topic Detection Of Incremental Clustering [D].Nanning:Guangxi University for Nationalities,2017.(in Chinese)王丽颖.增量式聚类的新闻热点话题发现研究[D].南宁:广西民族大学,2017.
[15]LIU Y,ZHENG C H.Research on Deep Network Crawler Based on Scrapy [J].Computer Engineering & Software,2017,38(7):111-114.(in Chinese)刘宇,郑成焕.基于Scrapy的深层网络爬虫研究[J].软件,2017,38(7):111-114.
[16]SUN Q Y,WANG J F,ZHAO Z Q,et al.A micro-blog data acquisition scheme based On simulated login [J].Computer Technology and Development,2014,24(3):6-10.(in Chinese)孙青云,王俊峰,赵宗渠,等.一种基于模拟登录的微博数据采集方案[J].计算机技术与发展,2014,24(3):6-10.
[17]WANG S Y,WANG X,SUN J Z.Web application test script repair based on XPath [J].Application Research of Computers,2017,34(5):1393-1396.(in Chinese)王曙燕,王璇,孙家泽.基于XPath路径的Web应用测试脚本修复[J].计算机应用研究,2017,34(5):1393-1396.
[18]SHI S S.Research on key technology and system of accurateWeb information extraction [D].Nanjing:Nanjing University,2017.(in Chinese)施生生.精确Web信息抽取关键技术与系统研究[D].南京:南京大学,2017.
[19]YU Y.Web crawler data collection for Weibo [J].China CIO News,2017(12):36-37.(in Chinese)于营.面向微博的网络爬虫数据采集[J].信息系统工程,2017(12):36-37.
[1] 潘孝勤, 芦天亮, 杜彦辉, 仝鑫.
基于深度学习的语音合成与转换技术综述
Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning
计算机科学, 2021, 48(8): 200-208. https://doi.org/10.11896/jsjkx.200500148
[2] 束云峰, 王中卿.
基于专利结构的中文专利摘要研究
Research on Chinese Patent Summarization Based on Patented Structure
计算机科学, 2020, 47(6A): 45-48. https://doi.org/10.11896/JsJkx.190500028
[3] 吴小坤, 赵甜芳.
自然语言处理技术在社会传播学中的应用研究和前景展望
Application of Natural Language Processing in Social Communication:A Review and Future Perspectives
计算机科学, 2020, 47(6): 184-193. https://doi.org/10.11896/jsjkx.191200151
[4] 禹鑫燚, 殷慧武, 施甜峰, 唐权瑞, 柏继华, 欧林林.
基于OPC UA的工业设备数据采集系统
Data Acquisition System of Industrial Equipment Based on OPC UA
计算机科学, 2020, 47(11A): 609-614. https://doi.org/10.11896/jsjkx.200500060
[5] 王立志,慕晓冬,刘宏岚.
采用改进粒子群优化的SVM方法实现中文文本情感分类
Using SVM Method Optimized by Improved Particle Swarm Optimization to Analyze Emotion of Chinese Text
计算机科学, 2020, 47(1): 231-236. https://doi.org/10.11896/jsjkx.181102130
[6] 王鹏跃, 郭茂祖, 赵玲玲, 张昱.
城市空气质量感知方法综述
Review on Urban Air Quality Perception Methods
计算机科学, 2019, 46(6A): 35-40.
[7] 冯安琪, 钱丽萍, 黄玉蘋, 吴远.
RFID环境下基于自适应卡尔曼滤波的高速移动车辆速度预测
RFID Data-driven Vehicle Speed Prediction Using Adaptive Kalman Filter
计算机科学, 2019, 46(4): 100-105. https://doi.org/10.11896/j.issn.1002-137X.2019.04.016
[8] 侯禹臣, 吴伟.
静态图像行为标注众包系统的设计与实现
Design and Implementation of Crowdsourcing System for Still Image Activity Annotation
计算机科学, 2019, 46(11A): 580-583.
[9] 黄国锐, 郭康, 王世贵, 蒋金波.
一种手持式数据采集终端的设计与实现
Design and Implementation ofHandheld Data Acquisition Terminal
计算机科学, 2019, 46(11A): 567-569.
[10] 黄熠,王娟.
PSO-GP中文文本情感分类方法研究
Research on Chinese Texts Sentiment Classification Approach Based on PSO-GP
计算机科学, 2017, 44(Z6): 446-450. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.100
[11] 木妮娜·玉素甫,古丽娜·玉素甫.
重复模式识别算法及在Web信息抽取和聚类分析中的应用
Repetitive Pattern Recognition Algorithms and Applications in Web Information Extraction and Clustering Analysis
计算机科学, 2017, 44(Z11): 39-45. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.007
[12] 徐雁飞,刘渊,吴文鹏.
社交网络数据采集技术研究与应用
Research and Application of Social Network Data Acquisition Technology
计算机科学, 2017, 44(1): 277-282. https://doi.org/10.11896/j.issn.1002-137X.2017.01.051
[13] 童名文,牛琳,杨琳,邹军华,上超望.
课程本体自动构建技术研究
Research on Technique of Course Ontology Automatically Constructing
计算机科学, 2016, 43(Z11): 108-112. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.023
[14] 刘博洋,马连博,朱云龙,邵伟平.
基于多层数据处理的嵌入式RFID中间件系统开发
Development of Embedded RFID Middleware System for Multilayer Data Processing
计算机科学, 2015, 42(Z11): 231-235.
[15] 许洋,李石坚,焦文均,潘纲.
用户驾驶行为建模的研究和应用
Driving Behavior Identification System and Application
计算机科学, 2015, 42(9): 1-6. https://doi.org/10.11896/j.issn.1002-137X.2015.09.001
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!