Computer Science ›› 2024, Vol. 51 ›› Issue (1): 26-34.doi: 10.11896/jsjkx.231100121

• Special Issue on the 52th Anniversary of Computer Science • Previous Articles     Next Articles

Exploring the Scientific Nature and Scientific Questions of Data Science

CHAO Lemen   

  1. Key Laboratory of Data Engineering and Knowledge Engineering,Renmin University of China,Beijing 100872,China
    School of Information Resource Management,Renmin University of China,Beijing 100872,China
  • Online:2024-01-15 Published:2024-01-12
  • About author:CHAO Lemen,born in 1979,Ph.D,professor,is a senior member of CCF(No.50431S).His main research interests include data science and big data analysis.
  • Supported by:
    National Natural Science Foundation of China(72074214).

Abstract: As an emerging academic field,the scientific nature of data science has garnered attention,and its scientific questions have not been clearly defined.This paper explores the scientific nature of data science from four aspects:scientific research paradigms and methodologies,falsifiability and reproducibility,scientific spirit and rapid iteration,and scientific research agenda and theoretical framework.It also answers the question of why data science is an emerging science.Building upon this foundation and incorporating concepts such as the DIKW model(data-information-knowledge-wisdom pyramid or hierarchy),the DMP model(data-model-problem model),the statistical and machine learning methodologies of data science,and the processes and activities in data science.This paper presents seven core scientific questions in data science:the precedence of explanation or data,problem alignment with data or data alignment with problems,prioritizing trust in data or models,emphasizing performance or interpre-tability,data partitioning strategies,solving unknown data problems with known data,and the role of humans within or outside the loop.Finally,four recommendations for data science research are proposed:a focus on theoretical research within data science itself,the further separation and specialization of data science in terms of science,technology,and engineering,strengthening the theory and practice of data science empowered by artificial intelligence,and fostering collaboration between the discipline of data science and data science within other disciplines.

Key words: Data science, Scientific nature, Scientific questions, DIKW model

CLC Number: 

  • TP391
[1]DONOHO D.50 years of data science[J].Journal of Computational and Graphical Statistics,2017,26(4):745-766.
[2]CHALMERS A F.What is this thing called science[M].Hac-kett Publishing,2013:1-304.
[3]BAKER M.1 500 scientists lift the lid on reproducibility[J].Nature,2016,533:452-454.
[4] EDDINGTON A S.Science and the unseen world[M].Quaker Press,2007:1-56.
[5]FORTUNATO S,BERGSTROM C T,BÖRNER K,et al.Science of science[J].Science,2018,359(6379):1-7.
[6]HEY T.The fourth paradigm[M].Washington:Microsoft Research,2009:1-4.
[7]O’NEIL C,SCHUTT R.Doing data science:Straight talk from the frontline[M].O'Reilly Media,Inc.,2013.
[8]朝乐门.数据科学理论与实践(第三版)[M].北京:清华大学出版社,2022:20-24.
[9] HARDING S.Can theories be refuted:Essays on the Duhem-Quine thesis[M].Dordrecht:Reidel Publishing Company,1975:205-259.
[10]ROWLEY J.The wisdom hierarchy:representations of theDIKW hierarchy[J].Journal of information science,2007,33(2):163-180.
[11]PROVOST F,FAWCETT T.Data science and its relationship to big data and data-driven decision making[J].Big Data,2013,1(1):51-59.
[12]LAKSHMANAN V.Data Science on the Google Cloud Platform(2nd Edition)[M]. O’Reilly Media,Inc.,2022.
[13]LAZER D,KENNEDY R,KING G,et al.The parable of Google Flu:traps in big data analysis[J].Science,2014,343(6176):1203-1205.
[14] MUNAFÒ M R,NOSEK B A,BISHOP D V M,et al.A manifesto for reproducible science[J].Nature Human Behaviour,2017,1(1):1-9.
[15] GUPTA P,MACAVANEY S.On survivorship bias in MSMARCO[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:2214-2219.
[16]SHARMA R,GARAYEV H,KAUSHIK M,et al.DetectingSimpson’s Paradox:A Machine Learning Perspective[C]//International Conference on Database and Expert Systems Applications.Cham:Springer International Publishing,2022:323-335.
[17] SHARMA R,KAUSHIK M,PEIOUS S A,et al.Why Not to Trust Big Data:Discussing Statistical Paradoxes[C]//International Conference on Database Systems for Advanced Applications.Cham:Springer International Publishing,2022:50-63.
[18]ZHU X,VONDRICK C,RAMANAN D,et al.Do We NeedMore Training Data or Better Models for Object Detection[C]//BMVC.2012:1-11.
[19]JUNQUÉ DE FORTUNY E,MARTENS D,PROVOST F.Predictive modeling with big data:is bigger really better[J].Big Data,2013,1(4):215-226.
[20]SORENSEN D.Statistical Learning in Genetics:An Introduction Using R[M].Cham:Springer International Publishing,2023:51-75.
[21]ROELOFS R,SHANKAR V,RECHT B,et al. A meta-analysis of overfitting in machine learning[J]. Advances in Neural Information Processing Systems,2019,32:9179-9189.
[22]SMITH S L,DHERIN B,BARRETT D G T,et al. On the origin of implicit regularization in stochastic gradient descent[J].arXiv:2101.12176,2021.
[23]INCHAUSTI P. Statistical Modeling With R:a dual frequentist and Bayesian approach for life scientists[M]. Oxford University Press,2023.
[24]VALIANT L. A theory of the learnable[J]. Communications of the ACM,1984,27(11):1134-1142.
[25]GUNNING D,STEFIK M,CHOI J,et al. XAI-Explainable artificial intelligence[J]. Science Robotics,2019,4(37):eaay7120.
[26]MOLNAR C.Interpretable Machine Learning:A Guide forMaking Black Box Models Explainable(2nd Edtion)[M]:Munich:Creative Commons,2022.
[27] DANILO B,NAOMI A,MARTIN K. Statistics versus machine learning[J]. Nature Methods,2018,15(4):233-234.
[28]HOERL A E,KENNARD R W. Ridge regression:Biased estimation for nonorthogonal problems[J].Technometrics,1970,12(1):55-67.
[29]TIBSHIRANI R. Regression shrinkage and selection via thelasso[J]. Journal of the Royal Statistical Society Series B:Statistical Methodology,1996,58(1):267-288.
[30]GHAHRAMANI Z. Probabilistic machine learning and artificial intelligence[J]. Nature,2015,521(7553):452-459.
[31]VAN DE SCHOOT R,DEPAOLI S,KING R,et al. Bayesian statistics and modelling[J]. Nature Reviews Methods Primers,2021,1(1):1.
[32]HAERTEL C,POHL M,NAHHAS A,et al. Toward A Lifecycle for Data Science:A Literature Review of Data Science Process Models[C]//PACIS 2022 Proceedings. 2022.
[33]O’NEIL C,SCHUTT R. Doing data science:Straight talk from the frontline[M]. O’Reilly Media Inc.,2013.
[34]MAI J E. Big data privacy:The datafication of personal information[J]. The Information Society,2016,32(3):192-199.
[35]SHARMAV,BALUSAMYB,THOMASJ J,et al. Data FabricArchitectures:Web-Driven Applications[M]. Berlin:Walter de Gruyter GmbH & Co KG,2023.
[36]WICKHAM H. Tidy Data[J]. Journal of Statistical Software,2014,59(10):1-23.
[37]WICKHAM H,ÇETINKAYA-RUNDEL M,GROLEMUNDG. R for Data Science(2nd Edition)[M]. Sebastopol:O’Reilly Media Inc.,2023.
[38]ZHENG A,CASARI A. Feature engineering for machine lear-ning:principles and techniques for data scientists[M].O’Reilly Media Inc.,2018.
[39]BENGIO Y,COURVILLE A,VINCENT P. Representationlearning:A review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1798-1828.
[40]MAYER-SCHÖNBERGER V,CUKIER K. Big data:A revolution that will transform how we live,work,and think[M]. Boston:Houghton Mifflin Harcourt,2013:50-61.
[41]PEARL J,MACKENZIE D. The book of why:the new science of cause and effect[M].New York:Basic Books,2018.
[42]程学旗,梅宏,赵伟,等. 数据科学与计算智能:内涵,范式与机遇[J]. 中国科学院院刊,2020,35(12):1470-1481.
[43]Gartner,Inc. Gartner’s analytic value escalator[OL].(2012-12-12). https://www.flickr.com/photos/27772229@N07/8267855748/.
[44]朝乐门.数据故事化[M].北京:电子工业出版社,2022:96-97.
[45]VAUGHAN D.Data Science:The Hard Parts[M]:Boston:O’Reilly Media,Inc.,2024.
[46]STAHL B C. Artificial intelligence for a better future:an ecosystem perspective on the ethics of AI and emerging digital technologies[M]. Springer Nature,2021.
[47]DE CREMER D,KASPAROV G. AI should augment humanintelligence,not replace it[J]. Harvard Business Review,2021,18:1.
[48]MCKENDRICK J,THURAI A.AI Isn't Ready to Make Unsupervised Decisions[OL].(2022-09-15).https://hbr.org/2022/09/ai-isnt-ready-to-make-unsupervised-decisions.
[49]WU X,XIAO L,SUN Y,et al. A survey of human-in-the-loop for machine learning[J]. Future Generation Computer Systems,2022,135:364-381.
[50]CHEN V,LIAO Q V,WORTMAN VAUGHAN J,et al. Understanding the role of human intuition on reliance in human-AI decision-making with explanations[J]. Proceedings of the ACM on Human-Computer Interaction,2023,7(CSCW2):1-32.
[51]SHAHRIARI B,SWERSKY K,WANG Z,et al. Taking the human out of the loop:A review of Bayesian optimization[J]. Proceedings of the IEEE,2015,104(1):148-175.
[1] CHAO Le-men, WANG Rui. Data Science Platform:Features,Technologies and Trends [J]. Computer Science, 2021, 48(8): 1-12.
[2] CHAO Le-men. Course Design and Redesign for Introduction to Data Science [J]. Computer Science, 2020, 47(7): 1-7.
[3] CHAO Le-men. Open-source Course and Open-sourcing Intro to Data Science [J]. Computer Science, 2020, 47(12): 114-118.
[4] NING Hui-cong. Study on Interdisciplinary Model of Construction of Big Data Discipline in China [J]. Computer Science, 2019, 46(11A): 159-162.
[5] LI Zhi-guo, ZHONG Jiang. Application of Data Science in Management Science Study:State-of-the-art in Domestic [J]. Computer Science, 2018, 45(9): 38-45.
[6] CHAO Le-men, XING Chun-xiao and WANG Yu-qing. Unique Curriculums for Data Science and Big Data Technology [J]. Computer Science, 2018, 45(3): 1-8.
[7] CHAO Le-men, XING Chun-xiao and ZHANG Yong. Data Science Studies:State-of-the-art and Trends [J]. Computer Science, 2018, 45(1): 1-13.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!