计算机科学 ›› 2018, Vol. 45 ›› Issue (9): 60-64.doi: 10.11896/j.issn.1002-137X.2018.09.008

• 第十六届全国软件与应用学术会议 • 上一篇    下一篇

一个基于Web的轻量级大数据处理与可视化工具

李炎1,2, 马俊明1, 安博1,2, 曹东刚1,2   

  1. 高可信软件技术教育部重点实验室北京大学 北京1008711
    北京大学信息科学技术学院 北京1008712
  • 收稿日期:2017-08-15 出版日期:2018-09-20 发布日期:2018-10-10
  • 通讯作者: 曹东刚(1975-),男,副教授,主要研究方向为云计算、分布式计算等,E-mail:caodg@pku.edu.cn
  • 作者简介:李 炎(1996-),男,硕士生,主要方向为计算机软件与理论,E-mail:1400012951@pku.edu.cn;马俊明(1994-),男,博士生,主要研究方向为软件工程,E-mail:mjm520@pku.edu.cn;安 博(1992-),男,博士生,主要研究方向为计算机软件与理论,E-mail:anbo@pku.edu.cn
  • 基金资助:
    本文受国家重点研发计划(2016YFB1000105),国家自然科学基金(61690201,61421091)资助。

Web Based Lightweight Tool for Big Data Processing and Visualization

LI Yan1,2, MA Jun-ming1, AN Bo1,2, CAO Dong-gang1,2   

  1. Key Lab of High Confidence Software TechnologiesPeking University,Ministry of Education,Beijing 100871,China1
    School of Electronic Engineering and Computer Science,Peking University,Beijing 100871,China2
  • Received:2017-08-15 Online:2018-09-20 Published:2018-10-10

摘要: 科研人员在日常研究中经常使用Excel,Spss等工具对数据进行分析加工来获得相关领域知识。然而随着大数据时代的到来,常用的数据处理软件因单机性能的限制已经不能满足科研人员对大数据分析处理的需求。大数据的处理和可视化离不开分布式计算环境。因此,为了完成对大数据的快速处理和可视化,科研人员不仅需要购置、维护分布式集群环境,还需要具备分布式环境下的编程能力和相应的前端数据可视化技术。这对很多非计算机科班的数据分析工作者而言是非常困难且不必要的。针对上述问题,提出了一种基于Web的轻量级大数据处理和可视化工具。通过该工具,数据分析工作者只需通过简单的点击和拖动,便可以在浏览器中轻松地打开大型数据文件(GB级别)、快速地对文件进行定位(跳转到文件某一行)、方便地调用分布式计算框架来对文件内容进行排序或求极大值、便捷地对数据进行可视化等。实证研究证明,该解决方案是有效的。

关键词: 并行计算, 大数据, 分布式系统, 数据分析, 数据可视化

Abstract: Researchers in the daily study often use Excel,Spss and other tools to analyze and process the data to obtain the knowledge of relevant field.However,with the arrival of large data age,due to constraints of stand-alone performance,general data processing software cannot meet the needs of researchers for large data analysis and processing.Large data processing and visualization are inseparable from the distributed computing environment.Therefore,in order to complete the rapid processing and visualization of large data,researchers not only need to purchase and maintain a distributed cluster environment,but also need to be able to program in a distributed environment and master the corresponding front-end data visualization technology.It is very difficult and unnecessary for many non-computer science data analysis workers.In view of the above problems,this paper presented a Web-based lightweight large data processing and visualization tool.Using this tool,data analysis workers can easily open a large data file(GB level) in the browser,quickly locate the file,sort the contents of the file and visualize it through a simple click and drag.At last,a correspon-ding empirical study was carried out to prove the effiectiveness of this solution.

Key words: ata analysis, Big data, Data visualization, Distributed system, Parallel computation

中图分类号: 

  • TP399
[1]AN B,MA J,CAO D,et al.Towards Efficient Resource Mana-gement in Virtual Clouds[C]∥2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW).Atlanta:IEEE Press,2017:320-324.
[2]CAO D G,AN B,SHI P C,et al.Providing Virtual Cloud for
Special Purposes on Demand in JointCloud Computing Environment[J].Journal of Computer Science and Technology,2017,32(2):211-218.
[3]ZHU Y J,MA J M,AN B,et al.Monitoring and Billing of a Lightweight Cloud System Based on Linux Container[C]∥2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW).Atlanta:IEEE Press,2017:325-329.
[4]AN B,SHAN X D,CUI Z C,et al.Workspace as a Service:an Online Working Environment for Private Cloud[C]∥2017 IEEE Symposium on Service-Oriented System Engineering (SOSE).San Francisco:IEEE Press,2017:19-27.
[5]KLUYVER T,RAGAN-KELLEY B,PÉREZ F,et al.Jupyter Notebooks-a publishing format for reproducible computational workflows[C]∥ International Conference on Electronic Publishing.2016:87-90.
[6]MCKINNEY W.Python for data analysis:Data wrangling with Pandas,NumPy,and IPython[M].America:O’Reilly Media,2012:111-150.
[7]MCKINNEY W.Pandas:a foundational Python library for data analysis and statistics[J/OL].Python for High Performance and Scientific Computing,https://www.researchgate.net/publication/265194455_pandas_a_Foundational_Python_Library_for_Data_Analysis_and_Statistics.
[8]Infogram.Inforgram(Version1.0)[EB/OL].https://www.infogram.com.
[9]DORY,MICHAEL,PARRISH A,et al.Introduction to Tornado:Modern Web Applications with Python[M].America:O’Reilly Media,2012:67-97.
[10]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.
Spark:Cluster computing with working sets[C]∥Usenix Conference on Hot Topics in Cloud Computing.2016:10.
[11]GROPP W,THAKUR R,LUSK E.Using MPI-2:Advanced
features of the message passing interface[M].America:MIT Press,1999:42-55.
[12]Handsontable.Handsontable(Version1.0)[EB/OL].https://www.handontable.com.
[13]XUE Z,LI R,ZHANG H,et al.DC-Top-k:A Novel Top-k Selecting Algorithm and Its Parallelization[C]∥2016 45th International Conference on Parallel Processing (ICPP).Philadelphia:IEEE Press,2016:370-379.
[14]HUNTER J D.Matplotlib:A 2D graphics environment[J].
Computing in Science & Engineering,2007,9(3):90-95.
[15]VITTER J S.External memory algorithms and data structures:Dealing with massive data[J].ACM Computing surveys (CsUR),2001,33(2):209-271.
[16]YANG H,DASDAN A,HSIAO R L,et al.Map-reduce-merge:simplified relational data processing on large clusters[C]∥Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data.Beijing:ACM,2007:1029-1040.
[17]BORGERT S,MÜHLHÄUSER M.A S-BPM Suite for the Execution of Cross Company Subject Oriented Business Processes[C]∥International Conference on Subject-Oriented Business Process Management.Eichstätt:Springer,2014:161-170.
[18]WANG H,SHI P,ZHANG Y.JointCloud:A Cross-Cloud Cooperation Architecture for Integrated Internet Service Customization[C]∥International Conference on Distributed Computing Systems.IEEE,2017:1846-1855.
[1] 陈晶, 吴玲玲.
多源异构环境下的车联网大数据混合属性特征检测方法
Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment
计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273
[2] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[3] 陈慧嫔, 王琨, 杨恒, 郑智捷.
蓝舌病毒基因组序列多元概率特征可视化分析
Visual Analysis of Multiple Probability Features of Bluetongue Virus Genome Sequence
计算机科学, 2022, 49(6A): 27-31. https://doi.org/10.11896/jsjkx.210300129
[4] 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香.
面向国产异构众核架构的CFD非结构网格计算并行优化方法
Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture
计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[5] 孙轩, 王焕骁.
政务大数据安全防护能力建设:基于技术和管理视角的探讨
Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives
计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010
[6] 丛颖男, 王兆毓, 朱金清.
关于法律人工智能数据和算法问题的若干思考
Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law
计算机科学, 2022, 49(4): 74-79. https://doi.org/10.11896/jsjkx.210900191
[7] 冯了了, 丁滟, 刘坤林, 马科林, 常俊胜.
区块链BFT共识算法研究进展
Research Advance on BFT Consensus Algorithms
计算机科学, 2022, 49(4): 329-339. https://doi.org/10.11896/jsjkx.210700011
[8] 王美珊, 姚兰, 高福祥, 徐军灿.
面向医疗集值数据的差分隐私保护技术研究
Study on Differential Privacy Protection for Medical Set-Valued Data
计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032
[9] 谭双杰, 林宝军, 刘迎春, 赵帅.
基于机器学习的分布式星载RTs系统负载调度算法
Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning
计算机科学, 2022, 49(2): 336-341. https://doi.org/10.11896/jsjkx.201200126
[10] 江昊琛, 魏子麒, 刘璘, 陈俊.
非均衡数据分类经典方法综述与面向医疗领域的实验分析
Imbalanced Data Classification:A Survey and Experiments in Medical Domain
计算机科学, 2022, 49(1): 80-88. https://doi.org/10.11896/jsjkx.210200124
[11] 王俊, 王修来, 庞威, 赵鸿飞.
面向科技前瞻预测的大数据治理研究
Research on Big Data Governance for Science and Technology Forecast
计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[12] 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳.
面向大数据分析的智能交互向导系统
Smart Interactive Guide System for Big Data Analytics
计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083
[13] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓.
基于深度学习的民事案件判决结果分类方法研究
Study on Judicial Data Classification Method Based on Natural Language Processing Technologies
计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[14] 骆菁菁, 唐卫贞, 丁继婷.
基于皮尔逊系数的管制仿真训练数据独立化与因子分析下的数据可视化研究
Research of ATC Simulator Training Values Independence Based on Pearson Correlation Coefficient and Study of Data Visualization Based on Factor Analysis
计算机科学, 2021, 48(6A): 623-628. https://doi.org/10.11896/jsjkx.210200021
[15] 傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文.
一种面向构件化并行应用程序的性能骨架分析方法
Performance Skeleton Analysis Method Towards Component-based Parallel Applications
计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!