计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 304-309.doi: 10.11896/j.issn.1002-137X.2019.05.047
曾健荣, 张仰森, 郑佳, 黄改娟, 陈若愚
ZENG Jian-rong, ZHANG Yang-sen, ZHENG Jia, HUANG Gai-juan, CHEN Ruo-yu
摘要: 基于大数据技术的社会计算方法是目前学术界研究的热点,如何从网络上快速获取相应的数据资源是相关研究的关键。网络爬虫技术是目前进行网络数据采集的主要手段,针对现有爬虫技术不便于采集多源数据的问题,提出了一种面向多数据源的网络爬虫数据采集技术,在研究新浪微博、人民日报、百度百科、百度贴吧、微信公众号、东方财富股吧等6类媒体平台的数据采集爬虫的基础上,采用Servlet后台调度技术,将面向多数据源的网络爬虫进行融合,解决了面向不同媒体平台的数据采集问题。在实现过程中,首先借助Web应用程序测试工具包selenium实现模拟登录等人工操作,然后采用Xpath元素查询技术来解析网页源码,并提取出数据信息存入数据库,最后将爬取到的数据从数据库中读取出来并展示在前端页面中。实验表明,爬虫在保证数据完整性的前提下实现了采集效率的最大化。
中图分类号:
[1]GAO F,LIU Z,GAO H.A universal vertical crawler with supervised breadth first search strategy[J].Computer Enginee-ring,2018,44(11):295-305.(in Chinese)高峰,刘震,高辉.结合有监督广度优先搜索策略的通用垂直爬虫[J].计算机工程,2018,44(11):295-305. [2]LIU P F.Design and Implementation of Jingdong Mall PublicOpinion Analysis System[D].Chengdu:University of Electronic Science and Technology of China,2015.(in Chinese)刘鹏飞.京东商城舆情分析系统的设计与实现[D].成都:电子科技大学,2015. [3]ZHAO J.Research on data acquisition and analysis method of social network[D].Zhengzhou:Zhengzhou University,2015.(in Chinese)赵俊.社交网络的数据采集与分析方法研究[D].郑州:郑州大学,2015. [4]LUO M.User data acquisition technology of sina micro-blog based on Python [J].Electronics World,2018(5):138-139.(in Chinese)罗咪.基于Python的新浪微博用户数据获取技术[J].电子世界,2018(5):138-139. [5]MADHUSUDAN P A,POONAM D.Deep web crawling effi-ciently using dynamic focused web crawler[J].International Research Journal of Engineering and Technology,2017,4(6):3303-3306. [6]OH H J,WON D H,KIM C,et al.Design and implementation of crawling algorithm to collect deep web information for web archiving[J].Data Technologies and Applications,2018,52(2):266-277. [7]KUMAR M,BINDAL A,GAUTAM R,et al.Keyword querybased focused Web crawler[J].Procedia Computer Science,2018,125:584-590. [8]PRUTHI J,MONIKA.Implementation of Category-Wise Fo-cused Web Crawler[M]∥Big Data Analytics.Advances in Intelligen Systems and Computing.Springer,Singapore,2017:565-574. [9]KIM T J,KIM H J.Machine Learning-Based Topical WebCrawler:An Ensemble Approach Incorporating Meta-Features[J].Journal of Engineering and Applied Sciences,2017,12(18):4651-4656. [10]AGRE G H,MAHAJAN N V.Keyword focused web crawler[C]∥Proceedings of IEEE International Conference on Electronics and Communication Systems (ICECS).New York:IEEE Press,2015:1089-1092. [11]SUN B.Design and implementation of multi thread crawlerbased on Python [J].Network Security Technology & Application,2018(4):38-39.(in Chinese)孙冰.基于Python的多线程网络爬虫的设计与实现[J].网络安全技术与应用,2018(4):38-39. [12]HU P R,LI S J.Focused crawler based on URL patterns [J].Application Research of Computers,2018,35(3):694-699,726.(in Chinese)胡萍瑞,李石君.基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699,726. [13]MENG Q H.Design and Implementation of Internet Data Incremental Acquisition System [D].Beijing:Beijing University of Posts and Telecommunications,2015.(in Chinese)孟庆浩.互联网数据增量采集系统的设计与实现[D].北京:北京邮电大学,2015. [14]WANG L Y.Research On Hot News Topic Detection Of Incremental Clustering [D].Nanning:Guangxi University for Nationalities,2017.(in Chinese)王丽颖.增量式聚类的新闻热点话题发现研究[D].南宁:广西民族大学,2017. [15]LIU Y,ZHENG C H.Research on Deep Network Crawler Based on Scrapy [J].Computer Engineering & Software,2017,38(7):111-114.(in Chinese)刘宇,郑成焕.基于Scrapy的深层网络爬虫研究[J].软件,2017,38(7):111-114. [16]SUN Q Y,WANG J F,ZHAO Z Q,et al.A micro-blog data acquisition scheme based On simulated login [J].Computer Technology and Development,2014,24(3):6-10.(in Chinese)孙青云,王俊峰,赵宗渠,等.一种基于模拟登录的微博数据采集方案[J].计算机技术与发展,2014,24(3):6-10. [17]WANG S Y,WANG X,SUN J Z.Web application test script repair based on XPath [J].Application Research of Computers,2017,34(5):1393-1396.(in Chinese)王曙燕,王璇,孙家泽.基于XPath路径的Web应用测试脚本修复[J].计算机应用研究,2017,34(5):1393-1396. [18]SHI S S.Research on key technology and system of accurateWeb information extraction [D].Nanjing:Nanjing University,2017.(in Chinese)施生生.精确Web信息抽取关键技术与系统研究[D].南京:南京大学,2017. [19]YU Y.Web crawler data collection for Weibo [J].China CIO News,2017(12):36-37.(in Chinese)于营.面向微博的网络爬虫数据采集[J].信息系统工程,2017(12):36-37. |
[1] | 潘孝勤, 芦天亮, 杜彦辉, 仝鑫. 基于深度学习的语音合成与转换技术综述 Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning 计算机科学, 2021, 48(8): 200-208. https://doi.org/10.11896/jsjkx.200500148 |
[2] | 束云峰, 王中卿. 基于专利结构的中文专利摘要研究 Research on Chinese Patent Summarization Based on Patented Structure 计算机科学, 2020, 47(6A): 45-48. https://doi.org/10.11896/JsJkx.190500028 |
[3] | 吴小坤, 赵甜芳. 自然语言处理技术在社会传播学中的应用研究和前景展望 Application of Natural Language Processing in Social Communication:A Review and Future Perspectives 计算机科学, 2020, 47(6): 184-193. https://doi.org/10.11896/jsjkx.191200151 |
[4] | 禹鑫燚, 殷慧武, 施甜峰, 唐权瑞, 柏继华, 欧林林. 基于OPC UA的工业设备数据采集系统 Data Acquisition System of Industrial Equipment Based on OPC UA 计算机科学, 2020, 47(11A): 609-614. https://doi.org/10.11896/jsjkx.200500060 |
[5] | 王立志,慕晓冬,刘宏岚. 采用改进粒子群优化的SVM方法实现中文文本情感分类 Using SVM Method Optimized by Improved Particle Swarm Optimization to Analyze Emotion of Chinese Text 计算机科学, 2020, 47(1): 231-236. https://doi.org/10.11896/jsjkx.181102130 |
[6] | 王鹏跃, 郭茂祖, 赵玲玲, 张昱. 城市空气质量感知方法综述 Review on Urban Air Quality Perception Methods 计算机科学, 2019, 46(6A): 35-40. |
[7] | 冯安琪, 钱丽萍, 黄玉蘋, 吴远. RFID环境下基于自适应卡尔曼滤波的高速移动车辆速度预测 RFID Data-driven Vehicle Speed Prediction Using Adaptive Kalman Filter 计算机科学, 2019, 46(4): 100-105. https://doi.org/10.11896/j.issn.1002-137X.2019.04.016 |
[8] | 侯禹臣, 吴伟. 静态图像行为标注众包系统的设计与实现 Design and Implementation of Crowdsourcing System for Still Image Activity Annotation 计算机科学, 2019, 46(11A): 580-583. |
[9] | 黄国锐, 郭康, 王世贵, 蒋金波. 一种手持式数据采集终端的设计与实现 Design and Implementation ofHandheld Data Acquisition Terminal 计算机科学, 2019, 46(11A): 567-569. |
[10] | 黄熠,王娟. PSO-GP中文文本情感分类方法研究 Research on Chinese Texts Sentiment Classification Approach Based on PSO-GP 计算机科学, 2017, 44(Z6): 446-450. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.100 |
[11] | 木妮娜·玉素甫,古丽娜·玉素甫. 重复模式识别算法及在Web信息抽取和聚类分析中的应用 Repetitive Pattern Recognition Algorithms and Applications in Web Information Extraction and Clustering Analysis 计算机科学, 2017, 44(Z11): 39-45. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.007 |
[12] | 徐雁飞,刘渊,吴文鹏. 社交网络数据采集技术研究与应用 Research and Application of Social Network Data Acquisition Technology 计算机科学, 2017, 44(1): 277-282. https://doi.org/10.11896/j.issn.1002-137X.2017.01.051 |
[13] | 童名文,牛琳,杨琳,邹军华,上超望. 课程本体自动构建技术研究 Research on Technique of Course Ontology Automatically Constructing 计算机科学, 2016, 43(Z11): 108-112. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.023 |
[14] | 刘博洋,马连博,朱云龙,邵伟平. 基于多层数据处理的嵌入式RFID中间件系统开发 Development of Embedded RFID Middleware System for Multilayer Data Processing 计算机科学, 2015, 42(Z11): 231-235. |
[15] | 许洋,李石坚,焦文均,潘纲. 用户驾驶行为建模的研究和应用 Driving Behavior Identification System and Application 计算机科学, 2015, 42(9): 1-6. https://doi.org/10.11896/j.issn.1002-137X.2015.09.001 |
|