计算机科学 ›› 2018, Vol. 45 ›› Issue (9): 60-64.doi: 10.11896/j.issn.1002-137X.2018.09.008
李炎1,2, 马俊明1, 安博1,2, 曹东刚1,2
LI Yan1,2, MA Jun-ming1, AN Bo1,2, CAO Dong-gang1,2
摘要: 科研人员在日常研究中经常使用Excel,Spss等工具对数据进行分析加工来获得相关领域知识。然而随着大数据时代的到来,常用的数据处理软件因单机性能的限制已经不能满足科研人员对大数据分析处理的需求。大数据的处理和可视化离不开分布式计算环境。因此,为了完成对大数据的快速处理和可视化,科研人员不仅需要购置、维护分布式集群环境,还需要具备分布式环境下的编程能力和相应的前端数据可视化技术。这对很多非计算机科班的数据分析工作者而言是非常困难且不必要的。针对上述问题,提出了一种基于Web的轻量级大数据处理和可视化工具。通过该工具,数据分析工作者只需通过简单的点击和拖动,便可以在浏览器中轻松地打开大型数据文件(GB级别)、快速地对文件进行定位(跳转到文件某一行)、方便地调用分布式计算框架来对文件内容进行排序或求极大值、便捷地对数据进行可视化等。实证研究证明,该解决方案是有效的。
中图分类号:
[1]AN B,MA J,CAO D,et al.Towards Efficient Resource Mana-gement in Virtual Clouds[C]∥2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW).Atlanta:IEEE Press,2017:320-324. [2]CAO D G,AN B,SHI P C,et al.Providing Virtual Cloud for Special Purposes on Demand in JointCloud Computing Environment[J].Journal of Computer Science and Technology,2017,32(2):211-218. [3]ZHU Y J,MA J M,AN B,et al.Monitoring and Billing of a Lightweight Cloud System Based on Linux Container[C]∥2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW).Atlanta:IEEE Press,2017:325-329. [4]AN B,SHAN X D,CUI Z C,et al.Workspace as a Service:an Online Working Environment for Private Cloud[C]∥2017 IEEE Symposium on Service-Oriented System Engineering (SOSE).San Francisco:IEEE Press,2017:19-27. [5]KLUYVER T,RAGAN-KELLEY B,PÉREZ F,et al.Jupyter Notebooks-a publishing format for reproducible computational workflows[C]∥ International Conference on Electronic Publishing.2016:87-90. [6]MCKINNEY W.Python for data analysis:Data wrangling with Pandas,NumPy,and IPython[M].America:O’Reilly Media,2012:111-150. [7]MCKINNEY W.Pandas:a foundational Python library for data analysis and statistics[J/OL].Python for High Performance and Scientific Computing,https://www.researchgate.net/publication/265194455_pandas_a_Foundational_Python_Library_for_Data_Analysis_and_Statistics. [8]Infogram.Inforgram(Version1.0)[EB/OL].https://www.infogram.com. [9]DORY,MICHAEL,PARRISH A,et al.Introduction to Tornado:Modern Web Applications with Python[M].America:O’Reilly Media,2012:67-97. [10]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al. Spark:Cluster computing with working sets[C]∥Usenix Conference on Hot Topics in Cloud Computing.2016:10. [11]GROPP W,THAKUR R,LUSK E.Using MPI-2:Advanced features of the message passing interface[M].America:MIT Press,1999:42-55. [12]Handsontable.Handsontable(Version1.0)[EB/OL].https://www.handontable.com. [13]XUE Z,LI R,ZHANG H,et al.DC-Top-k:A Novel Top-k Selecting Algorithm and Its Parallelization[C]∥2016 45th International Conference on Parallel Processing (ICPP).Philadelphia:IEEE Press,2016:370-379. [14]HUNTER J D.Matplotlib:A 2D graphics environment[J]. Computing in Science & Engineering,2007,9(3):90-95. [15]VITTER J S.External memory algorithms and data structures:Dealing with massive data[J].ACM Computing surveys (CsUR),2001,33(2):209-271. [16]YANG H,DASDAN A,HSIAO R L,et al.Map-reduce-merge:simplified relational data processing on large clusters[C]∥Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data.Beijing:ACM,2007:1029-1040. [17]BORGERT S,MÜHLHÄUSER M.A S-BPM Suite for the Execution of Cross Company Subject Oriented Business Processes[C]∥International Conference on Subject-Oriented Business Process Management.Eichstätt:Springer,2014:161-170. [18]WANG H,SHI P,ZHANG Y.JointCloud:A Cross-Cloud Cooperation Architecture for Integrated Internet Service Customization[C]∥International Conference on Distributed Computing Systems.IEEE,2017:1846-1855. |
[1] | 陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273 |
[2] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[3] | 陈慧嫔, 王琨, 杨恒, 郑智捷. 蓝舌病毒基因组序列多元概率特征可视化分析 Visual Analysis of Multiple Probability Features of Bluetongue Virus Genome Sequence 计算机科学, 2022, 49(6A): 27-31. https://doi.org/10.11896/jsjkx.210300129 |
[4] | 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157 |
[5] | 孙轩, 王焕骁. 政务大数据安全防护能力建设:基于技术和管理视角的探讨 Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives 计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010 |
[6] | 丛颖男, 王兆毓, 朱金清. 关于法律人工智能数据和算法问题的若干思考 Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law 计算机科学, 2022, 49(4): 74-79. https://doi.org/10.11896/jsjkx.210900191 |
[7] | 冯了了, 丁滟, 刘坤林, 马科林, 常俊胜. 区块链BFT共识算法研究进展 Research Advance on BFT Consensus Algorithms 计算机科学, 2022, 49(4): 329-339. https://doi.org/10.11896/jsjkx.210700011 |
[8] | 王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032 |
[9] | 谭双杰, 林宝军, 刘迎春, 赵帅. 基于机器学习的分布式星载RTs系统负载调度算法 Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning 计算机科学, 2022, 49(2): 336-341. https://doi.org/10.11896/jsjkx.201200126 |
[10] | 江昊琛, 魏子麒, 刘璘, 陈俊. 非均衡数据分类经典方法综述与面向医疗领域的实验分析 Imbalanced Data Classification:A Survey and Experiments in Medical Domain 计算机科学, 2022, 49(1): 80-88. https://doi.org/10.11896/jsjkx.210200124 |
[11] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 |
[12] | 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳. 面向大数据分析的智能交互向导系统 Smart Interactive Guide System for Big Data Analytics 计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083 |
[13] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130 |
[14] | 骆菁菁, 唐卫贞, 丁继婷. 基于皮尔逊系数的管制仿真训练数据独立化与因子分析下的数据可视化研究 Research of ATC Simulator Training Values Independence Based on Pearson Correlation Coefficient and Study of Data Visualization Based on Factor Analysis 计算机科学, 2021, 48(6A): 623-628. https://doi.org/10.11896/jsjkx.210200021 |
[15] | 傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文. 一种面向构件化并行应用程序的性能骨架分析方法 Performance Skeleton Analysis Method Towards Component-based Parallel Applications 计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115 |
|