计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 1-12.doi: 10.11896/jsjkx.210600033

所属专题: 大数据&数据科学 虚拟专题

• 数据库&大数据&数据科学* • 上一篇    下一篇

数据科学平台:特征、技术及趋势

朝乐门, 王锐   

  1. 数据工程与知识工程教育部重点实验室(中国人民大学) 北京100872; 中国人民大学信息资源管理学院 北京100872
  • 收稿日期:2021-04-03 修回日期:2021-06-03 发布日期:2021-08-10
  • 通讯作者: 王锐(wangrui1998@ruc.edu.cn)
  • 基金资助:
    国家自然科学基金项目(72074214)

Data Science Platform:Features,Technologies and Trends

CHAO Le-men, WANG Rui   

  1. Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of China),Beijing 100872,China; School of Information Resource Management,Renmin University of China,Beijing 100872,China
  • Received:2021-04-03 Revised:2021-06-03 Published:2021-08-10
  • About author:CHAO Le-men,born in 1979,Ph.D,associate professor,Ph.D supervisor.His main research interests include data science and big data analysis.(chaolemen@ruc.edu.cn)WANG Rui,born in 1998,postgra-duate.Her main research interests include data science and big data analysis.
  • Supported by:
    National Natural Science Foundation of China (72074214).

摘要: 以2015年以来的《Gartner数据科学平台魔力象限系列年度报告》为线索,分析调研35种数据科学平台产品,提出数据科学平台的定义和类型。数据科学平台相关学术研究中的主要科学问题涉及数据科学平台的设计、数据科学平台的可扩展性、基于数据湖的数据科学平台研发、数据科学平台的支持团队协作能力、数据科学平台的开放策略以及数据科学平台工程方法论。数据科学平台的主要特征包括模块化开发及集成能力、开发运维一体化、重视可扩展性、强调用户体验、重视非专业级数据科学家以及重视人机协同场景;数据科学平台的实现需要的关键技术为机器学习、流处理技术、数据规整化、容器化技术和数据可视化;数据科学平台的未来发展趋势主要体现在与人工智能的融合、对开源技术的支持、对非专业级数据科学家的重视、数据治理的集成、数据湖的引入、高级分析及应用的探索、向数据科学全流水线的转型和应用领域的多样化等;数据科学平台的研发活动应遵循以激活数据价值为中心、人在环路(human-in-the loop)的设计模式、开发运维一体化、可用性和可解释性的平衡、数据科学产品生态系统的培育、强调用户体验以及与其他业务系统的集成等设计原则。现阶段的数据科学平台研发亟待在数据偏见与公平性、鲁棒性及稳定性、隐私保护、因果分析、可信任/负责任数据科学平台等方面进行理论突破。

关键词: 开发运维一体化, 可解释性, 可扩展性, 数据科学家, 数据科学平台

Abstract: The concept and types of data science platform are proposed based upon in-depth studies of more than 35 data science platforms from the annual report of Magic Quadrant for Data Science Platforms since 2015.The main scientific issues in the academic research of data science platform involve the design of data science platform,the scalability of data science platform,the research and development of data science platform based on data lake,the supporting team cooperation ability of data science platform,the open strategy of data science platform and the engineering methodology of data science platform.The main features of data science platform include modular development and integration capability,DevOps,emphasis on scalability,emphasis on user experience,emphasis on citizen data scientist,and emphasis on human-machine collaboration scenario.The key technologies for the realization of data science platform are machine learning,stream processing,tidy data,containerization and data visualization.The future development trend of data science platform is mainly reflected in the integration with artificial intelligence,the support for open source technology,the emphasis on citizen data scientists,the integration of data governance,the introduction of data lake,the exploration of advanced analysis and application,the transformation to the whole pipeline of data science and the diversification of application fields.The research and development activities of data science platform should follow the design principles of activating data value as the center,human-in-the loop,DevOps,balance of usability and explainability,cultivation of data science product ecosystem,emphasis on user experience and ease of use,and integration with other business systems.At present,the research and development of data science platform needs theoretical breakthroughs in data bias and fairness,robustness and stability,privacy protection,causal analysis,trusted/responsible data science platform.

Key words: Data science platform, Data scientist, DevOps, Explainability, Scalability

中图分类号: 

  • TP391
[1]What Is a Data Science Platform? [EB/OL].(2021-03-23)[2021-05-22].https://blog.dataiku.com/what-is-a-data-science-platform.
[2]IDOINE C,KRENSKY P,BRETHENOUX E,et al.MagicQuadrant for data science and machine-learning platforms[R].Gartner,Inc,2021.
[3]MARUNGO F,ROBERTSON S,QUON H,et al.Creating a data science platform for developing complication risk models for personalized treatment planning in radiation oncology[C]//2015 48th Hawaii International Conference on System Sciences.IEEE,2015:3132-3140.
[4]WARD L,DUNN A,FAGHANINIA A,et al.Matminer:Anopen source toolkit for materials data mining[J].Computational Materials Science,2018,152:60-69.
[5]DOBRE C,XHAFA F.Intelligent services for big data science[J].Future Generation Computer Systems,2014,37:267-281.
[6]MIAO K,LI J,HONG W,et al.A Microservice-Based Big Data Analysis Platform for Online Educational Applications[J].Scientific Programming,2020,2020:1-13.
[7]MCPADDEN J,DURANT T J S,BUNCH D R,et al.Health care and precision medicine research:analysis of a scalable data science platform[J].Journal of Medical Internet Research,2019,21(4):e13043.
[8]TOROUS J,KIANG M V,LORME J,et al.New tools for new research in psychiatry:a scalable and customizable platform to empower data driven smartphone research[J].JMIR Mental Health,2016,3(2):e16.
[9]NARGESIAN F,ZHU E,MILLER R J,et al.Data lake management:challenges and opportunities[J].Proceedings of the VLDB Endowment,2019,12(12):1986-1989.
[10]FANG H.Managing data lakes in big data era:What's a datalake and why has it became popular in data management ecosystem[C]//2015 IEEE International Conference on Cyber Technology in Automation,Control,and Intelligent Systems (CYBER).IEEE,2015:820-824.
[11]ESPOSITO C,CASTIGLIONE A,TUDORICA C A,et al.Security and privacy for cloud-based data management in the health network service chain:a microservice approach[J].IEEE Communications Magazine,2017,55(9):102-108.
[12]PATTERSON E,MCBURNEY R,SCHMIDT H,et al.Data-flow representation of data analyses:Toward a platform for collaborative data science[J].IBM Journal of Research and Deve-lopment,2017,61(6):9:1-9:13.
[13]POLDRACK R A,GORGOLEWSKI K J,VAROQUAUX G.Computational and informatic advances for reproducible data analysis in neuroimaging[J].Annual Review of Biomedical Data Science,2019,2(1):119-138.
[14]KADIYALA A,KUMAR A.Applications of Python to evaluate environmental data science problems[J].Environmental Progress & Sustainable Energy,2017,36(6):1580-1586.
[15]CHEN J,TAO Y,WANG H,et al.Big data based fraud risk management at Alibaba[J].The Journal of Finance and Data Science,2015,1(1):1-10.
[16]Microsoft Azure Data Catalog[EB/OL].( 2019-08-01) [2021-05-22].https://docs.microsoft.com/en-us/azure/data-catalog/overview.
[17]IKEDA R,WIDOM J.Data lineage:A survey[R].Stanford InfoLab,2009.
[18]WOODRUFF A,STONEBRAKER M.Supporting fine-graineddata lineage in a database visualization environment[C]//Proceedings 13th International Conference on Data Engineering.IEEE,1997:91-102.
[19]KANDEL S,HEER J,PLAISANT C,et al.Research directions in data wrangling:Visualizations and transformations for usable and credible data[J].Information Visualization,2011,10(4):271-288.
[20]FURCHE T,GOTTLOB G,LIBKIN L,et al.Data Wrangling for Big Data:Challenges and Opportunities[C]//EDBT.2016,16:473-478.
[21]CHAO L M,XING C X,ZHANG Y.Data Science Studies:State-of-the-art and Trends[J].Computer Science,2018,45(1):1-13.
[22]Data Wrangling with Spotfire[EB/OL].[2021-05- 22].https://www.tibco.com/products/tibco-spotfire/data-wrangling.
[23]LIU C,MAO Y,VAN DER MERWE J,et al.Cloud resource orchestration:A data-centric approach[C]//Proceedings of the biennial Conference on Innovative Data Systems Research (CIDR).2011:1-8.
[24]LIU X,LIU Y,SONG H,et al.Big data orchestration as a ser-vice network[J].IEEE Communications Magazine,2017,55(9):94-101.
[25]The KNIME Model Process Factory [EB/OL].( 2017-05-08 ) [2021-05-22].https://www.knime.com/blog/the-knime-mo-del-process-factory.
[26]What is Power BI [EB/OL].[2021-05- 22].https://powerbi.microsoft.com/zh-cn/what-is- power-bi/.
[27]Data refresh in Power BI [EB/OL].(2021-05-07) [2021-5-22].https://docs.microsoft.com/en-us/power-bi/connect-data/refresh-data.
[28]MÄKINEN S,SKOGSTRÖM H,LAAKSONEN E,et al.WhoNeeds MLOps:What Data Scientists Seek to Accomplish and How Can MLOpsHelp?[J].arXiv:2103.08942,2021.
[29]Platform Component:Model Ops[EB/OL].[2021-05-22].https://www.dominodatalab.com/product/model-ops/.
[30]Domino Model Monitor[EB/OL].[2021-05-22].https://www.dominodatalab.com/product/domino-model-monitor/.
[31]ERETH J.DataOps-Towards a Definition[J].LWDA,2018,2191:104-112.
[32]BASS L,WEBER I,ZHU L.DevOps:A software architect's perspective[M].Addison-Wesley Professional,2015.
[33]What is Azure DevOps? [EB/OL].(2021-01-22)[2021-05-22].https://docs.microsoft.com/en-us/azure/devops/user-guide/what-is-azure-devops?view=azure-devops.
[34]DANG Y,LIN Q,HUANG P.AIOps:real-world challenges and research innovations[C]//2019 IEEE/ACM 41st International Conference on Software Engineering:Companion Proceedings (ICSE-Companion).IEEE,2019:4-5.
[35]IBM Cloud Pak for Watson AIOps[EB/OL].[2021-05-22].https://www.ibm.com/cloud/cloud-pak-for-watson-aiops?lnk=STW_US_STESCH&lnk2=learn_CloudPakAIOps&pexp=DEF&psrc=NONE&mhsrc=ibmsearch_a&mhq=AIOPS.
[36]JOGALEKAR P,WOODSIDE M.Evaluating the scalability of distributed systems[J].IEEE Transactions on parallel and distributed systems,2000,11(6):589-603.
[37]PAHL C,XIONG H,WALSHE R.A comparison of on-premise to cloud migration approaches[C]//European Conference on Service-Oriented and Cloud Computing.Berlin,Heidelberg:Springer,2013:212-226.
[38]HASSENZAHL M,TRACTINSKY N.User experience-aresearch agenda[J].Behaviour & Information Technology,2006,25(2):91-97.
[39]STOLTE C,TANG D,HANRAHAN P.Polaris:A system for query,analysis,and visualization of multidimensional relational databases[J].IEEE Transactions on Visualization and Computer Graphics,2002,8(1):52-65.
[40]MACKINLAY J,HANRAHAN P,STOLTE C.Show me:Automatic presentation for visual analysis[J].IEEE Transactions on Visualization and Computer Graphics,2007,13(6):1137-1144.
[41]TSIAKMAKI M,KOSTOPOULOS G,KOTSIANTIS S,et al.Implementing AutoML in educational data mining for prediction tasks[J].Applied Sciences,2020,10(1):90.
[42]HUMMER W,MUTHUSAMY V,RAUSCH T,et al.Modelops:Cloud-based lifecycle management for reliable and trusted AI[C]//2019 IEEE International Conference on Cloud Engineering (IC2E).IEEE,2019:113-120.
[43]PRAT N.Augmented analytics[J].Business & InformationSystems Engineering,2019,61(3):375-380.
[44]ZHENG N,LIU Z,REN P,et al.Hybrid-augmented intelli-gence:collaboration and cognition[J].Frontiers of Information Technology & Electronic Engineering,2017,18(2):153-179.
[45]SRIDHAR V,SUBRAMANIAN S,ARTEAGA D,et al.Model governance:Reducing the anarchy of production ML[C]//2018 {USENIX} Annual Technical Conference.2018:351-358.
[46]GUNNING D.Explainable artificial intelligence (xai)[R].Defense Advanced Research Projects Agency (DARPA),2017.
[47]MIAO H,LI A,DAVIS L S,et al.Towards unified data and lifecycle management for deep learning[C]//2017 IEEE 33rd International Conference on Data Engineering (ICDE).IEEE,2017:571-582.
[48]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[49]HUANG L,JOSEPH A D,NELSON B,et al.Adversarial machine learning[C]//Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence.2011:43-58.
[50]NAYAK S,GOURISARIA M K,RAUTARAY P M.Recent Dimensions of Data Science:A Survey[M]//Advances in Data and Information Sciences.Singapore:Springer,2020:465-476.
[51]SHAHRIVARI S.Beyond batch processing:towards real-timeand streaming big data[J].Computers,2014,3(4):117-129.
[52]WICKHAM H.Tidy data[J].Journal of statistical software,2014,59(10):1-23.
[53]3 common messy data problems and how to tidy them in SAS [EB/OL].(2016-06-02)[2021-05-22].https://communities.sas.com/t5/SAS-Communities-Library/3-common-messy-data-problems-and-how-to-tidy-them-in-SAS/ta-p/272165.
[54]PERER A,LIU S.Visualization in data science[J].IEEE Computer Graphics and Applications,2019,39(5):18-19.
[55]PATHAK S,PATHAK S.Data Visualization Techniques,Mo-del and Taxonomy[M]//Data Visualization and Knowledge Engineering.Springer,Cham,2020:249-271.
[56]KUMAR R S S,NYSTRÖM M,LAMBERT J,et al.Adversarial machine learning-industry perspectives[C]//2020 IEEE Security and Privacy Workshops (SPW).IEEE,2020:69-75.
[57]Adversarial Machine Learning [EB/OL].[2021-05-24].https://researcher.watson.ibm.com/researcher/view_group.php?id=9571.
[58]Threat Modeling AI/ML Systems and Dependencies[EB/OL].(2019-11-11)[2021-05-24].https://docs.microsoft.com/en-us/security/engineering/threat-modeling-aiml.
[59]Hype cycle for artificial intelligence[EB/OL].(2020-07-27)[2021-05-24].https://www.gartner.com/en/documents/3988006/hype-cycle-for-artificial-intelligence-2020.
[60]Gartner Says More Than 40 Percent of Data Science Tasks Will Be Automated by 2020[EB/OL].(2017-01-16)[2021-05-24].https://www.gartner.com/en/newsroom/press-releases/2017-01-16-gartner-says-more-than-40-percent-of-data-science-tasks-will-be-automated-by-2020.
[61]DAFOE A.AI governance:a research agenda[R].Governance of AI Program,Future of Humanity Institute,University of Oxford,2018.
[62]ARRIETA A B,DíAZ-RODRíGUEZ N,DEL SER J,et al.Explainable Artificial Intelligence (XAI):Concepts,taxonomies,opportunities and challenges toward responsible AI[J].Information Fusion,2020,58:82-115.
[63]ZHAO X,JOHNSON M E.Access governance:Flexibility with escalation and audit[C]//2010 43rd Hawaii International Conference on System Sciences.IEEE,2010:1-13.
[64]ENGLAND P,LAMPSON B,MANFERDELLI J,et al.A trusted open platform[J].Computer,2003,36(7):55-62.
[65]KSHIRSAGAR M,ROBINSON C,YANG S,et al.BecomingGood at AI for Good[J].arXiv:2104.11757,2021.
[66]The Delta Lake Series-Lakehouse [EB/OL].[2021-05-24].https://databricks.com/p/ebook/the-delta-lake-series-lakehouse.
[67]ROSSI R,AHMED N.The network data repository with interactive graph analytics and visualization[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2015.
[68]BANKS J.Discrete event system simulation[M].Pearson Education India,2005.
[69]SHARMA P.Discrete-event simulation[J].International Journal of Scientific & Technology Research,2015,4(4):136-140.
[70]MACAL C,NORTH M.Introductory tutorial:Agent-basedmodeling and simulation[C]//Proceedings of the Winter Simulation Conference 2014.IEEE,2014:6-20.
[71]WHITE P.The power of the industrial internet:turning data into insight and action[J].Journal of Petroleum Technology,2014,66(11):90-93.
[72]PRETLOVE J,SKOURUP C.Human in the loop[J].ABB Review,2007,1:6-10.
[73]SARKAR S,WEYDE T,GARCEZ A,et al.Accuracy and inter-pretability trade-offs in machine learning applied to safer gambling[C]//CEUR Workshop Proceedings.2016:1773.
[74]ADADI A,BERRADA M.Peeking inside the black-box:a survey on explainable artificial intelligence (XAI)[J].IEEE access,2018,6:52138-52160.
[75]KOSARA R,MACKINLAY J.Storytelling:the next step for vi-sualization[J].Computer,2013,46(5):44-50.
[76]CHAO L M,ZHANG C.Data Storytelling:From Data Perception to Data Cognition[J].Journal of Library Science in China,2019,45(5):61-78.
[77]PENG R D.Reproducible research in computational science[J].Science,2011,334(6060):1226-1227.
[78]MUNAFÒ M R,NOSEK B A,BISHOP D V M,et al.A manifesto for reproducible science[J].Nature Human Behaviour,2017,1(1):1-9.
[79]WEIßGERBER T,GRANITZER M.Mapping platforms into a new open science model for machine learning[J].it-Information Technology,2019,61(4):197-208.
[80]SALEIRO P,RODOLFA K T,GHANI R.Dealing with bias and fairness in data science systems:A practical hands-on tutorial[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3513-3514.
[81]TSIPRAS D,SANTURKAR S,ENGSTROM L,et al.Robustness may be at odds with accuracy[J].arXiv:1805.12152,2018.
[82]MULLIGAN D K,KOOPMAN C,DOTY N.Privacy is an essentially contested concept:a multi-dimensional analytic for mapping privacy[J].Philosophical Transactions of the Royal Society A:Mathematical,Physical and Engineering Sciences,2016,374(2083):20160118.
[83]PROVOST F,FAWCETT T.Data science and its relationship to big data and data-driven decision making[J].Big data,2013,1(1):51-59.
[84]PASSI S,JACKSON S J.Trust in data science:Collaboration,translation,and accountability in corporate data science projects[J].Proceedings of the ACM on Human-Computer Interaction,2018,2(CSCW):1-28.
[85]H2O.ai + COVID-19 [EB/OL].[2021-05-24].https://www.h2o.ai/covid-19/.
[86]LATIF S,USMAN M,MANZOOR S,et al.Leveraging datascience to combat covid-19:A comprehensive review[J].IEEE Transactions on Artificial Intelligence,2020,1(1):85-103.
[1] 王明, 武文芳, 王大玲, 冯时, 张一飞.
生成链接树:一种高数据真实性的反事实解释生成方法
Generative Link Tree:A Counterfactual Explanation Generation Approach with High Data Fidelity
计算机科学, 2022, 49(9): 33-40. https://doi.org/10.11896/jsjkx.220300158
[2] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[3] 成科扬, 王宁, 崔宏纲, 詹永照.
基于局部注意力图互迁移的可解释性优化方法
Interpretability Optimization Method Based on Mutual Transfer of Local Attention Map
计算机科学, 2022, 49(5): 64-70. https://doi.org/10.11896/jsjkx.210400176
[4] 张佳嘉, 张小洪.
多分支卷积神经网络肺结节分类方法及其可解释性
Multi-branch Convolutional Neural Network for Lung Nodule Classification and Its Interpretability
计算机科学, 2020, 47(9): 129-134. https://doi.org/10.11896/jsjkx.190700203
[5] 庄园, 郭强, 张洁, 曾云辉.
大规模申威众核环境下二维数据计算的可扩展方法
Large Scalability Method of 2D Computation on Shenwei Many-core
计算机科学, 2020, 47(8): 87-92. https://doi.org/10.11896/jsjkx.191000011
[6] 叶少杰, 汪小益, 徐才巢, 孙建伶.
BitXHub:基于侧链中继的异构区块链互操作平台
BitXHub:Side-relay Chain Based Heterogeneous Blockchain Interoperable Platform
计算机科学, 2020, 47(6): 294-302. https://doi.org/10.11896/jsjkx.191100055
[7] 吴斌烽.
基于微服务架构的物联网中间件设计
Design of IoT Middleware Based on Microservices Architecture
计算机科学, 2019, 46(6A): 580-584.
[8] 赵兴旺,梁吉业,郭兰杰.
一种基于空间变换的协同过滤推荐算法
Collaborative Filtering Recommendation Algorithm Based on Space Transformation
计算机科学, 2018, 45(7): 16-21. https://doi.org/10.11896/j.issn.1002-137X.2018.07.003
[9] 张仕将,柴晶,陈泽华,贺海武.
基于Gossip协议的拜占庭共识算法
Byzantine Consensus Algorithm Based on Gossip Protocol
计算机科学, 2018, 45(2): 20-24. https://doi.org/10.11896/j.issn.1002-137X.2018.02.004
[10] 海沫,张游.
Spark平台下聚类算法的性能比较
Performance Comparison of Clustering Algorithms in Spark
计算机科学, 2017, 44(Z6): 414-418. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.093
[11] 魏霖静,练智超,王联国,侯振兴.
基于词条与语意差异度量的文档聚类算法
Term and Semantic Difference Metric Based Document Clustering Algorithm
计算机科学, 2016, 43(12): 229-233. https://doi.org/10.11896/j.issn.1002-137X.2016.12.042
[12] 段晓阳,韩志杰,王冠男.
基于蜂拥的P2P流媒体系统可扩展性分析
Analysis on Scalability of P2P Streaming System Based on Flash Crowd
计算机科学, 2012, 39(Z6): 142-145.
[13] 程珍.
自组装DNA计算的研究进展及展望
Research Advances and Prospect of DNA Computing by Self-assembly
计算机科学, 2012, 39(5): 14-18.
[14] 吴伟,卿鹏,漆锋滨.
FILiC:一种CUDA上的交互型库函数框架
FILiC:A Framework for Interactive Library on CUDA
计算机科学, 2012, 39(3): 124-127.
[15] 祝永志,田甜.
基于高性能微机群集的可扩展性的研究与设计
Design and Implementation of Scalability Based on High Performance PCs Cluster
计算机科学, 2010, 37(12): 287-291.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!