Computer Science ›› 2021, Vol. 48 ›› Issue (8): 1-12.doi: 10.11896/jsjkx.210600033

Special Issue: Big Data & Data Scinece

• Database & Big Data & Data Science • Previous Articles     Next Articles

Data Science Platform:Features,Technologies and Trends

CHAO Le-men, WANG Rui   

  1. Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of China),Beijing 100872,China; School of Information Resource Management,Renmin University of China,Beijing 100872,China
  • Received:2021-04-03 Revised:2021-06-03 Published:2021-08-10
  • About author:CHAO Le-men,born in 1979,Ph.D,associate professor,Ph.D supervisor.His main research interests include data science and big data analysis.(chaolemen@ruc.edu.cn)WANG Rui,born in 1998,postgra-duate.Her main research interests include data science and big data analysis.
  • Supported by:
    National Natural Science Foundation of China (72074214).

Abstract: The concept and types of data science platform are proposed based upon in-depth studies of more than 35 data science platforms from the annual report of Magic Quadrant for Data Science Platforms since 2015.The main scientific issues in the academic research of data science platform involve the design of data science platform,the scalability of data science platform,the research and development of data science platform based on data lake,the supporting team cooperation ability of data science platform,the open strategy of data science platform and the engineering methodology of data science platform.The main features of data science platform include modular development and integration capability,DevOps,emphasis on scalability,emphasis on user experience,emphasis on citizen data scientist,and emphasis on human-machine collaboration scenario.The key technologies for the realization of data science platform are machine learning,stream processing,tidy data,containerization and data visualization.The future development trend of data science platform is mainly reflected in the integration with artificial intelligence,the support for open source technology,the emphasis on citizen data scientists,the integration of data governance,the introduction of data lake,the exploration of advanced analysis and application,the transformation to the whole pipeline of data science and the diversification of application fields.The research and development activities of data science platform should follow the design principles of activating data value as the center,human-in-the loop,DevOps,balance of usability and explainability,cultivation of data science product ecosystem,emphasis on user experience and ease of use,and integration with other business systems.At present,the research and development of data science platform needs theoretical breakthroughs in data bias and fairness,robustness and stability,privacy protection,causal analysis,trusted/responsible data science platform.

Key words: Data science platform, Data scientist, DevOps, Explainability, Scalability

CLC Number: 

  • TP391
[1]What Is a Data Science Platform? [EB/OL].(2021-03-23)[2021-05-22].https://blog.dataiku.com/what-is-a-data-science-platform.
[2]IDOINE C,KRENSKY P,BRETHENOUX E,et al.MagicQuadrant for data science and machine-learning platforms[R].Gartner,Inc,2021.
[3]MARUNGO F,ROBERTSON S,QUON H,et al.Creating a data science platform for developing complication risk models for personalized treatment planning in radiation oncology[C]//2015 48th Hawaii International Conference on System Sciences.IEEE,2015:3132-3140.
[4]WARD L,DUNN A,FAGHANINIA A,et al.Matminer:Anopen source toolkit for materials data mining[J].Computational Materials Science,2018,152:60-69.
[5]DOBRE C,XHAFA F.Intelligent services for big data science[J].Future Generation Computer Systems,2014,37:267-281.
[6]MIAO K,LI J,HONG W,et al.A Microservice-Based Big Data Analysis Platform for Online Educational Applications[J].Scientific Programming,2020,2020:1-13.
[7]MCPADDEN J,DURANT T J S,BUNCH D R,et al.Health care and precision medicine research:analysis of a scalable data science platform[J].Journal of Medical Internet Research,2019,21(4):e13043.
[8]TOROUS J,KIANG M V,LORME J,et al.New tools for new research in psychiatry:a scalable and customizable platform to empower data driven smartphone research[J].JMIR Mental Health,2016,3(2):e16.
[9]NARGESIAN F,ZHU E,MILLER R J,et al.Data lake management:challenges and opportunities[J].Proceedings of the VLDB Endowment,2019,12(12):1986-1989.
[10]FANG H.Managing data lakes in big data era:What's a datalake and why has it became popular in data management ecosystem[C]//2015 IEEE International Conference on Cyber Technology in Automation,Control,and Intelligent Systems (CYBER).IEEE,2015:820-824.
[11]ESPOSITO C,CASTIGLIONE A,TUDORICA C A,et al.Security and privacy for cloud-based data management in the health network service chain:a microservice approach[J].IEEE Communications Magazine,2017,55(9):102-108.
[12]PATTERSON E,MCBURNEY R,SCHMIDT H,et al.Data-flow representation of data analyses:Toward a platform for collaborative data science[J].IBM Journal of Research and Deve-lopment,2017,61(6):9:1-9:13.
[13]POLDRACK R A,GORGOLEWSKI K J,VAROQUAUX G.Computational and informatic advances for reproducible data analysis in neuroimaging[J].Annual Review of Biomedical Data Science,2019,2(1):119-138.
[14]KADIYALA A,KUMAR A.Applications of Python to evaluate environmental data science problems[J].Environmental Progress & Sustainable Energy,2017,36(6):1580-1586.
[15]CHEN J,TAO Y,WANG H,et al.Big data based fraud risk management at Alibaba[J].The Journal of Finance and Data Science,2015,1(1):1-10.
[16]Microsoft Azure Data Catalog[EB/OL].( 2019-08-01) [2021-05-22].https://docs.microsoft.com/en-us/azure/data-catalog/overview.
[17]IKEDA R,WIDOM J.Data lineage:A survey[R].Stanford InfoLab,2009.
[18]WOODRUFF A,STONEBRAKER M.Supporting fine-graineddata lineage in a database visualization environment[C]//Proceedings 13th International Conference on Data Engineering.IEEE,1997:91-102.
[19]KANDEL S,HEER J,PLAISANT C,et al.Research directions in data wrangling:Visualizations and transformations for usable and credible data[J].Information Visualization,2011,10(4):271-288.
[20]FURCHE T,GOTTLOB G,LIBKIN L,et al.Data Wrangling for Big Data:Challenges and Opportunities[C]//EDBT.2016,16:473-478.
[21]CHAO L M,XING C X,ZHANG Y.Data Science Studies:State-of-the-art and Trends[J].Computer Science,2018,45(1):1-13.
[22]Data Wrangling with Spotfire[EB/OL].[2021-05- 22].https://www.tibco.com/products/tibco-spotfire/data-wrangling.
[23]LIU C,MAO Y,VAN DER MERWE J,et al.Cloud resource orchestration:A data-centric approach[C]//Proceedings of the biennial Conference on Innovative Data Systems Research (CIDR).2011:1-8.
[24]LIU X,LIU Y,SONG H,et al.Big data orchestration as a ser-vice network[J].IEEE Communications Magazine,2017,55(9):94-101.
[25]The KNIME Model Process Factory [EB/OL].( 2017-05-08 ) [2021-05-22].https://www.knime.com/blog/the-knime-mo-del-process-factory.
[26]What is Power BI [EB/OL].[2021-05- 22].https://powerbi.microsoft.com/zh-cn/what-is- power-bi/.
[27]Data refresh in Power BI [EB/OL].(2021-05-07) [2021-5-22].https://docs.microsoft.com/en-us/power-bi/connect-data/refresh-data.
[28]MÄKINEN S,SKOGSTRÖM H,LAAKSONEN E,et al.WhoNeeds MLOps:What Data Scientists Seek to Accomplish and How Can MLOpsHelp?[J].arXiv:2103.08942,2021.
[29]Platform Component:Model Ops[EB/OL].[2021-05-22].https://www.dominodatalab.com/product/model-ops/.
[30]Domino Model Monitor[EB/OL].[2021-05-22].https://www.dominodatalab.com/product/domino-model-monitor/.
[31]ERETH J.DataOps-Towards a Definition[J].LWDA,2018,2191:104-112.
[32]BASS L,WEBER I,ZHU L.DevOps:A software architect's perspective[M].Addison-Wesley Professional,2015.
[33]What is Azure DevOps? [EB/OL].(2021-01-22)[2021-05-22].https://docs.microsoft.com/en-us/azure/devops/user-guide/what-is-azure-devops?view=azure-devops.
[34]DANG Y,LIN Q,HUANG P.AIOps:real-world challenges and research innovations[C]//2019 IEEE/ACM 41st International Conference on Software Engineering:Companion Proceedings (ICSE-Companion).IEEE,2019:4-5.
[35]IBM Cloud Pak for Watson AIOps[EB/OL].[2021-05-22].https://www.ibm.com/cloud/cloud-pak-for-watson-aiops?lnk=STW_US_STESCH&lnk2=learn_CloudPakAIOps&pexp=DEF&psrc=NONE&mhsrc=ibmsearch_a&mhq=AIOPS.
[36]JOGALEKAR P,WOODSIDE M.Evaluating the scalability of distributed systems[J].IEEE Transactions on parallel and distributed systems,2000,11(6):589-603.
[37]PAHL C,XIONG H,WALSHE R.A comparison of on-premise to cloud migration approaches[C]//European Conference on Service-Oriented and Cloud Computing.Berlin,Heidelberg:Springer,2013:212-226.
[38]HASSENZAHL M,TRACTINSKY N.User experience-aresearch agenda[J].Behaviour & Information Technology,2006,25(2):91-97.
[39]STOLTE C,TANG D,HANRAHAN P.Polaris:A system for query,analysis,and visualization of multidimensional relational databases[J].IEEE Transactions on Visualization and Computer Graphics,2002,8(1):52-65.
[40]MACKINLAY J,HANRAHAN P,STOLTE C.Show me:Automatic presentation for visual analysis[J].IEEE Transactions on Visualization and Computer Graphics,2007,13(6):1137-1144.
[41]TSIAKMAKI M,KOSTOPOULOS G,KOTSIANTIS S,et al.Implementing AutoML in educational data mining for prediction tasks[J].Applied Sciences,2020,10(1):90.
[42]HUMMER W,MUTHUSAMY V,RAUSCH T,et al.Modelops:Cloud-based lifecycle management for reliable and trusted AI[C]//2019 IEEE International Conference on Cloud Engineering (IC2E).IEEE,2019:113-120.
[43]PRAT N.Augmented analytics[J].Business & InformationSystems Engineering,2019,61(3):375-380.
[44]ZHENG N,LIU Z,REN P,et al.Hybrid-augmented intelli-gence:collaboration and cognition[J].Frontiers of Information Technology & Electronic Engineering,2017,18(2):153-179.
[45]SRIDHAR V,SUBRAMANIAN S,ARTEAGA D,et al.Model governance:Reducing the anarchy of production ML[C]//2018 {USENIX} Annual Technical Conference.2018:351-358.
[46]GUNNING D.Explainable artificial intelligence (xai)[R].Defense Advanced Research Projects Agency (DARPA),2017.
[47]MIAO H,LI A,DAVIS L S,et al.Towards unified data and lifecycle management for deep learning[C]//2017 IEEE 33rd International Conference on Data Engineering (ICDE).IEEE,2017:571-582.
[48]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[49]HUANG L,JOSEPH A D,NELSON B,et al.Adversarial machine learning[C]//Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence.2011:43-58.
[50]NAYAK S,GOURISARIA M K,RAUTARAY P M.Recent Dimensions of Data Science:A Survey[M]//Advances in Data and Information Sciences.Singapore:Springer,2020:465-476.
[51]SHAHRIVARI S.Beyond batch processing:towards real-timeand streaming big data[J].Computers,2014,3(4):117-129.
[52]WICKHAM H.Tidy data[J].Journal of statistical software,2014,59(10):1-23.
[53]3 common messy data problems and how to tidy them in SAS [EB/OL].(2016-06-02)[2021-05-22].https://communities.sas.com/t5/SAS-Communities-Library/3-common-messy-data-problems-and-how-to-tidy-them-in-SAS/ta-p/272165.
[54]PERER A,LIU S.Visualization in data science[J].IEEE Computer Graphics and Applications,2019,39(5):18-19.
[55]PATHAK S,PATHAK S.Data Visualization Techniques,Mo-del and Taxonomy[M]//Data Visualization and Knowledge Engineering.Springer,Cham,2020:249-271.
[56]KUMAR R S S,NYSTRÖM M,LAMBERT J,et al.Adversarial machine learning-industry perspectives[C]//2020 IEEE Security and Privacy Workshops (SPW).IEEE,2020:69-75.
[57]Adversarial Machine Learning [EB/OL].[2021-05-24].https://researcher.watson.ibm.com/researcher/view_group.php?id=9571.
[58]Threat Modeling AI/ML Systems and Dependencies[EB/OL].(2019-11-11)[2021-05-24].https://docs.microsoft.com/en-us/security/engineering/threat-modeling-aiml.
[59]Hype cycle for artificial intelligence[EB/OL].(2020-07-27)[2021-05-24].https://www.gartner.com/en/documents/3988006/hype-cycle-for-artificial-intelligence-2020.
[60]Gartner Says More Than 40 Percent of Data Science Tasks Will Be Automated by 2020[EB/OL].(2017-01-16)[2021-05-24].https://www.gartner.com/en/newsroom/press-releases/2017-01-16-gartner-says-more-than-40-percent-of-data-science-tasks-will-be-automated-by-2020.
[61]DAFOE A.AI governance:a research agenda[R].Governance of AI Program,Future of Humanity Institute,University of Oxford,2018.
[62]ARRIETA A B,DíAZ-RODRíGUEZ N,DEL SER J,et al.Explainable Artificial Intelligence (XAI):Concepts,taxonomies,opportunities and challenges toward responsible AI[J].Information Fusion,2020,58:82-115.
[63]ZHAO X,JOHNSON M E.Access governance:Flexibility with escalation and audit[C]//2010 43rd Hawaii International Conference on System Sciences.IEEE,2010:1-13.
[64]ENGLAND P,LAMPSON B,MANFERDELLI J,et al.A trusted open platform[J].Computer,2003,36(7):55-62.
[65]KSHIRSAGAR M,ROBINSON C,YANG S,et al.BecomingGood at AI for Good[J].arXiv:2104.11757,2021.
[66]The Delta Lake Series-Lakehouse [EB/OL].[2021-05-24].https://databricks.com/p/ebook/the-delta-lake-series-lakehouse.
[67]ROSSI R,AHMED N.The network data repository with interactive graph analytics and visualization[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2015.
[68]BANKS J.Discrete event system simulation[M].Pearson Education India,2005.
[69]SHARMA P.Discrete-event simulation[J].International Journal of Scientific & Technology Research,2015,4(4):136-140.
[70]MACAL C,NORTH M.Introductory tutorial:Agent-basedmodeling and simulation[C]//Proceedings of the Winter Simulation Conference 2014.IEEE,2014:6-20.
[71]WHITE P.The power of the industrial internet:turning data into insight and action[J].Journal of Petroleum Technology,2014,66(11):90-93.
[72]PRETLOVE J,SKOURUP C.Human in the loop[J].ABB Review,2007,1:6-10.
[73]SARKAR S,WEYDE T,GARCEZ A,et al.Accuracy and inter-pretability trade-offs in machine learning applied to safer gambling[C]//CEUR Workshop Proceedings.2016:1773.
[74]ADADI A,BERRADA M.Peeking inside the black-box:a survey on explainable artificial intelligence (XAI)[J].IEEE access,2018,6:52138-52160.
[75]KOSARA R,MACKINLAY J.Storytelling:the next step for vi-sualization[J].Computer,2013,46(5):44-50.
[76]CHAO L M,ZHANG C.Data Storytelling:From Data Perception to Data Cognition[J].Journal of Library Science in China,2019,45(5):61-78.
[77]PENG R D.Reproducible research in computational science[J].Science,2011,334(6060):1226-1227.
[78]MUNAFÒ M R,NOSEK B A,BISHOP D V M,et al.A manifesto for reproducible science[J].Nature Human Behaviour,2017,1(1):1-9.
[79]WEIßGERBER T,GRANITZER M.Mapping platforms into a new open science model for machine learning[J].it-Information Technology,2019,61(4):197-208.
[80]SALEIRO P,RODOLFA K T,GHANI R.Dealing with bias and fairness in data science systems:A practical hands-on tutorial[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3513-3514.
[81]TSIPRAS D,SANTURKAR S,ENGSTROM L,et al.Robustness may be at odds with accuracy[J].arXiv:1805.12152,2018.
[82]MULLIGAN D K,KOOPMAN C,DOTY N.Privacy is an essentially contested concept:a multi-dimensional analytic for mapping privacy[J].Philosophical Transactions of the Royal Society A:Mathematical,Physical and Engineering Sciences,2016,374(2083):20160118.
[83]PROVOST F,FAWCETT T.Data science and its relationship to big data and data-driven decision making[J].Big data,2013,1(1):51-59.
[84]PASSI S,JACKSON S J.Trust in data science:Collaboration,translation,and accountability in corporate data science projects[J].Proceedings of the ACM on Human-Computer Interaction,2018,2(CSCW):1-28.
[85]H2O.ai + COVID-19 [EB/OL].[2021-05-24].https://www.h2o.ai/covid-19/.
[86]LATIF S,USMAN M,MANZOOR S,et al.Leveraging datascience to combat covid-19:A comprehensive review[J].IEEE Transactions on Artificial Intelligence,2020,1(1):85-103.
[1] LI Ying, YU Ya-xin, ZHANG Hong-yu, LI Zhen-guo. High Trusted Cloud Storage Model Based on TBchain Blockchain [J]. Computer Science, 2020, 47(9): 330-338.
[2] ZHUANG Yuan, GUO Qiang, ZHANG Jie, ZENG Yun-hui. Large Scalability Method of 2D Computation on Shenwei Many-core [J]. Computer Science, 2020, 47(8): 87-92.
[3] YE Shao-jie, WANG Xiao-yi, XU Cai-chao, SUN Jian-ling. BitXHub:Side-relay Chain Based Heterogeneous Blockchain Interoperable Platform [J]. Computer Science, 2020, 47(6): 294-302.
[4] WU Bin-feng. Design of IoT Middleware Based on Microservices Architecture [J]. Computer Science, 2019, 46(6A): 580-584.
[5] ZHAO Xing-wang,LIANG Ji-ye,GUO Lan-jie. Collaborative Filtering Recommendation Algorithm Based on Space Transformation [J]. Computer Science, 2018, 45(7): 16-21.
[6] ZHANG Shi-jiang, CHAI Jing, CHEN Ze-hua and HE Hai-wu. Byzantine Consensus Algorithm Based on Gossip Protocol [J]. Computer Science, 2018, 45(2): 20-24.
[7] HAI Mo and ZHANG You. Performance Comparison of Clustering Algorithms in Spark [J]. Computer Science, 2017, 44(Z6): 414-418.
[8] ZHOU Qiang, XIE Jing and ZHAO Hua-ming. Architecture and Solution for Large Web Sites [J]. Computer Science, 2017, 44(Z6): 587-590.
[9] TANG Bing, Laurent BOBELIN and HE Hai-wu. Parallel Algorithm of Nonnegative Matrix Factorization Based on Hybrid MPI and OpenMP Programming Model [J]. Computer Science, 2017, 44(3): 51-54.
[10] LIU Lin and ZHOU Jian-tao. Review for Research of Control Plane in Software-defined Network [J]. Computer Science, 2017, 44(2): 75-81.
[11] ZHENG Sheng and LI Tong. Data Placement Algorithm for Large-scale Storage System [J]. Computer Science, 2013, 40(Z11): 270-273.
[12] . Parallel Benchmark for Evaluating Parallel Simulation Engine [J]. Computer Science, 2013, 40(3): 41-45.
[13] . Analysis on Scalability of P2P Streaming System Based on Flash Crowd [J]. Computer Science, 2012, 39(Z6): 142-145.
[14] WU Wei,QING Peng,QI Feng-bin. FILiC:A Framework for Interactive Library on CUDA [J]. Computer Science, 2012, 39(3): 124-127.
[15] SUN Yong-lin,LIU Zhong. Document Clustering Algorithm Based on Dynamic Interval Mapping [J]. Computer Science, 2010, 37(6): 23-27.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!