计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 26-34.doi: 10.11896/jsjkx.231100121

• 创刊五十周年特别专题 • 上一篇    下一篇

数据科学的科学性与科学问题的分析

朝乐门   

  1. 数据工程与知识工程教育部重点实验室(中国人民大学) 北京100872
    中国人民大学信息资源管理学院 北京100872
  • 出版日期:2024-01-15 发布日期:2024-01-12
  • 通讯作者: 朝乐门(chaolemen@ruc.edu.cn)
  • 基金资助:
    国家自然科学基金(72074214)

Exploring the Scientific Nature and Scientific Questions of Data Science

CHAO Lemen   

  1. Key Laboratory of Data Engineering and Knowledge Engineering,Renmin University of China,Beijing 100872,China
    School of Information Resource Management,Renmin University of China,Beijing 100872,China
  • Online:2024-01-15 Published:2024-01-12
  • About author:CHAO Lemen,born in 1979,Ph.D,professor,is a senior member of CCF(No.50431S).His main research interests include data science and big data analysis.
  • Supported by:
    National Natural Science Foundation of China(72074214).

摘要: 作为一门新兴的学科领域,数据科学的科学性受到了关注且其科学问题未明确提出。文中从科学研究范式及方法论、可证伪性和可再现性、科学精神及快速迭代以及科学研究纲领及理论体系4个方面探讨了数据科学的“科学性”,并解答了为什么数据科学是一门新兴科学的问题。在此基础上,结合DIKW模型(DIKW Pyramid or Hierarchy)、DMP(Data-Model-Problem)模型、数据科学的统计学和机器学习方法论以及数据科学的流程与活动,提出了数据科学的7个核心科学问题:解释在先还是在后或无、问题对齐数据还是数据对齐问题、更加相信数据还是模型、更加重视性能还是可解释性、如何划分数据、如何用已知数据解决未知数据的问题、人在环路还是人出环路。最后,提出了数据科学研究的4点建议:聚焦数据科学本身的理论研究,推动数据的科学、技术和工程需要进一步分离和专业化,加强人工智能赋能的数据科学的理论与实践以及数据科学学科(Data Science as A Discipline)与学科中的数据科学(Data Science Within A Discipline)的联动。

关键词: 数据科学, 科学属性, 科学问题, DIKW模型

Abstract: As an emerging academic field,the scientific nature of data science has garnered attention,and its scientific questions have not been clearly defined.This paper explores the scientific nature of data science from four aspects:scientific research paradigms and methodologies,falsifiability and reproducibility,scientific spirit and rapid iteration,and scientific research agenda and theoretical framework.It also answers the question of why data science is an emerging science.Building upon this foundation and incorporating concepts such as the DIKW model(data-information-knowledge-wisdom pyramid or hierarchy),the DMP model(data-model-problem model),the statistical and machine learning methodologies of data science,and the processes and activities in data science.This paper presents seven core scientific questions in data science:the precedence of explanation or data,problem alignment with data or data alignment with problems,prioritizing trust in data or models,emphasizing performance or interpre-tability,data partitioning strategies,solving unknown data problems with known data,and the role of humans within or outside the loop.Finally,four recommendations for data science research are proposed:a focus on theoretical research within data science itself,the further separation and specialization of data science in terms of science,technology,and engineering,strengthening the theory and practice of data science empowered by artificial intelligence,and fostering collaboration between the discipline of data science and data science within other disciplines.

Key words: Data science, Scientific nature, Scientific questions, DIKW model

中图分类号: 

  • TP391
[1]DONOHO D.50 years of data science[J].Journal of Computational and Graphical Statistics,2017,26(4):745-766.
[2]CHALMERS A F.What is this thing called science[M].Hac-kett Publishing,2013:1-304.
[3]BAKER M.1 500 scientists lift the lid on reproducibility[J].Nature,2016,533:452-454.
[4] EDDINGTON A S.Science and the unseen world[M].Quaker Press,2007:1-56.
[5]FORTUNATO S,BERGSTROM C T,BÖRNER K,et al.Science of science[J].Science,2018,359(6379):1-7.
[6]HEY T.The fourth paradigm[M].Washington:Microsoft Research,2009:1-4.
[7]O’NEIL C,SCHUTT R.Doing data science:Straight talk from the frontline[M].O'Reilly Media,Inc.,2013.
[8]朝乐门.数据科学理论与实践(第三版)[M].北京:清华大学出版社,2022:20-24.
[9] HARDING S.Can theories be refuted:Essays on the Duhem-Quine thesis[M].Dordrecht:Reidel Publishing Company,1975:205-259.
[10]ROWLEY J.The wisdom hierarchy:representations of theDIKW hierarchy[J].Journal of information science,2007,33(2):163-180.
[11]PROVOST F,FAWCETT T.Data science and its relationship to big data and data-driven decision making[J].Big Data,2013,1(1):51-59.
[12]LAKSHMANAN V.Data Science on the Google Cloud Platform(2nd Edition)[M]. O’Reilly Media,Inc.,2022.
[13]LAZER D,KENNEDY R,KING G,et al.The parable of Google Flu:traps in big data analysis[J].Science,2014,343(6176):1203-1205.
[14] MUNAFÒ M R,NOSEK B A,BISHOP D V M,et al.A manifesto for reproducible science[J].Nature Human Behaviour,2017,1(1):1-9.
[15] GUPTA P,MACAVANEY S.On survivorship bias in MSMARCO[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:2214-2219.
[16]SHARMA R,GARAYEV H,KAUSHIK M,et al.DetectingSimpson’s Paradox:A Machine Learning Perspective[C]//International Conference on Database and Expert Systems Applications.Cham:Springer International Publishing,2022:323-335.
[17] SHARMA R,KAUSHIK M,PEIOUS S A,et al.Why Not to Trust Big Data:Discussing Statistical Paradoxes[C]//International Conference on Database Systems for Advanced Applications.Cham:Springer International Publishing,2022:50-63.
[18]ZHU X,VONDRICK C,RAMANAN D,et al.Do We NeedMore Training Data or Better Models for Object Detection[C]//BMVC.2012:1-11.
[19]JUNQUÉ DE FORTUNY E,MARTENS D,PROVOST F.Predictive modeling with big data:is bigger really better[J].Big Data,2013,1(4):215-226.
[20]SORENSEN D.Statistical Learning in Genetics:An Introduction Using R[M].Cham:Springer International Publishing,2023:51-75.
[21]ROELOFS R,SHANKAR V,RECHT B,et al. A meta-analysis of overfitting in machine learning[J]. Advances in Neural Information Processing Systems,2019,32:9179-9189.
[22]SMITH S L,DHERIN B,BARRETT D G T,et al. On the origin of implicit regularization in stochastic gradient descent[J].arXiv:2101.12176,2021.
[23]INCHAUSTI P. Statistical Modeling With R:a dual frequentist and Bayesian approach for life scientists[M]. Oxford University Press,2023.
[24]VALIANT L. A theory of the learnable[J]. Communications of the ACM,1984,27(11):1134-1142.
[25]GUNNING D,STEFIK M,CHOI J,et al. XAI-Explainable artificial intelligence[J]. Science Robotics,2019,4(37):eaay7120.
[26]MOLNAR C.Interpretable Machine Learning:A Guide forMaking Black Box Models Explainable(2nd Edtion)[M]:Munich:Creative Commons,2022.
[27] DANILO B,NAOMI A,MARTIN K. Statistics versus machine learning[J]. Nature Methods,2018,15(4):233-234.
[28]HOERL A E,KENNARD R W. Ridge regression:Biased estimation for nonorthogonal problems[J].Technometrics,1970,12(1):55-67.
[29]TIBSHIRANI R. Regression shrinkage and selection via thelasso[J]. Journal of the Royal Statistical Society Series B:Statistical Methodology,1996,58(1):267-288.
[30]GHAHRAMANI Z. Probabilistic machine learning and artificial intelligence[J]. Nature,2015,521(7553):452-459.
[31]VAN DE SCHOOT R,DEPAOLI S,KING R,et al. Bayesian statistics and modelling[J]. Nature Reviews Methods Primers,2021,1(1):1.
[32]HAERTEL C,POHL M,NAHHAS A,et al. Toward A Lifecycle for Data Science:A Literature Review of Data Science Process Models[C]//PACIS 2022 Proceedings. 2022.
[33]O’NEIL C,SCHUTT R. Doing data science:Straight talk from the frontline[M]. O’Reilly Media Inc.,2013.
[34]MAI J E. Big data privacy:The datafication of personal information[J]. The Information Society,2016,32(3):192-199.
[35]SHARMAV,BALUSAMYB,THOMASJ J,et al. Data FabricArchitectures:Web-Driven Applications[M]. Berlin:Walter de Gruyter GmbH & Co KG,2023.
[36]WICKHAM H. Tidy Data[J]. Journal of Statistical Software,2014,59(10):1-23.
[37]WICKHAM H,ÇETINKAYA-RUNDEL M,GROLEMUNDG. R for Data Science(2nd Edition)[M]. Sebastopol:O’Reilly Media Inc.,2023.
[38]ZHENG A,CASARI A. Feature engineering for machine lear-ning:principles and techniques for data scientists[M].O’Reilly Media Inc.,2018.
[39]BENGIO Y,COURVILLE A,VINCENT P. Representationlearning:A review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1798-1828.
[40]MAYER-SCHÖNBERGER V,CUKIER K. Big data:A revolution that will transform how we live,work,and think[M]. Boston:Houghton Mifflin Harcourt,2013:50-61.
[41]PEARL J,MACKENZIE D. The book of why:the new science of cause and effect[M].New York:Basic Books,2018.
[42]程学旗,梅宏,赵伟,等. 数据科学与计算智能:内涵,范式与机遇[J]. 中国科学院院刊,2020,35(12):1470-1481.
[43]Gartner,Inc. Gartner’s analytic value escalator[OL].(2012-12-12). https://www.flickr.com/photos/27772229@N07/8267855748/.
[44]朝乐门.数据故事化[M].北京:电子工业出版社,2022:96-97.
[45]VAUGHAN D.Data Science:The Hard Parts[M]:Boston:O’Reilly Media,Inc.,2024.
[46]STAHL B C. Artificial intelligence for a better future:an ecosystem perspective on the ethics of AI and emerging digital technologies[M]. Springer Nature,2021.
[47]DE CREMER D,KASPAROV G. AI should augment humanintelligence,not replace it[J]. Harvard Business Review,2021,18:1.
[48]MCKENDRICK J,THURAI A.AI Isn't Ready to Make Unsupervised Decisions[OL].(2022-09-15).https://hbr.org/2022/09/ai-isnt-ready-to-make-unsupervised-decisions.
[49]WU X,XIAO L,SUN Y,et al. A survey of human-in-the-loop for machine learning[J]. Future Generation Computer Systems,2022,135:364-381.
[50]CHEN V,LIAO Q V,WORTMAN VAUGHAN J,et al. Understanding the role of human intuition on reliance in human-AI decision-making with explanations[J]. Proceedings of the ACM on Human-Computer Interaction,2023,7(CSCW2):1-32.
[51]SHAHRIARI B,SWERSKY K,WANG Z,et al. Taking the human out of the loop:A review of Bayesian optimization[J]. Proceedings of the IEEE,2015,104(1):148-175.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!