计算机科学 ›› 2018, Vol. 45 ›› Issue (1): 1-13.doi: 10.11896/j.issn.1002-137X.2018.01.001

• 综述 •    下一篇

数据科学研究的现状与趋势

朝乐门,邢春晓,张勇   

  1. 数据工程与知识工程教育部重点实验室中国人民大学 北京100872;中国人民大学信息资源管理学院 北京100872,清华大学计算机科学与技术系 北京100084;清华大学信息技术研究院 北京100084;清华信息科学与技术国家实验室筹 北京100084,清华大学计算机科学与技术系 北京100084;清华大学信息技术研究院 北京100084;清华信息科学与技术国家实验室筹 北京100084
  • 出版日期:2018-01-15 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金项目(91646202,71103020),国家社会科学基金(15BTQ054,12&ZD220)资助

Data Science Studies:State-of-the-art and Trends

CHAO Le-men, XING Chun-xiao and ZHANG Yong   

  • Online:2018-01-15 Published:2018-11-13

摘要: 大数据时代的到来催生了一门新的学科——数据科学。首先,探讨了数据科学的内涵、发展简史、学科地位及知识体系等基本问题,并提出了专业数据科学与专业中的数据科学之间的区别与联系。其次,分析现阶段数据科学的研究特点,并分别提出了专业数据科学、专业中的数据科学及大数据生态系统中的相对热门话题。接着,探讨了数据科学研究中的10个争议及挑战:思维模式的转变(知识范式还是数据范式)、对数据的认识(主动属性还是被动属性)、对智能的认识(更好的算法还是更多的数据)、主要瓶颈(数据密集型还是计算密集型)、数据准备(数据预处理还是数据加工)、服务质量(精准度还是用户体验)、数据分析(解释性分析还是预测性分析)、算法评价(复杂度还是扩展性)、研究范式(第三范式还是第四范式)、人才培养(数据工程师还是数据科学家)。然后,提出了数据科学研究的10个发展趋势:预测模型及相关分析的重视,模型集成及元分析的兴起,数据在先、模式在后或无模式的出现,数据一致性及现实主义的回归,多副本技术及靠近数据原则的广泛应用,多样化技术及一体化应用并存,简单计算及实用主义占据主导地位,数据产品开发及数据科学的嵌入式应用,专家余及公众数据科学的兴起,数据科学家与人才培养的探讨。最后,结合文中工作,对数据科学研究者给出了几点建议和注意事项。

关键词: 数据科学,大数据,数据产品开发,数据加工,数据驱动

Abstract: The entering big data era gives rise to a novel discipline called data science.First,the differences between domain-general data science and domain-specific data science were proposed based upon conducting an in-depth discussion on its basic concept,brief history,scientific roles and the body of knowledge.Secondly,top ten challenges faced by data science were identified via describing the debates on paradoxical topics including the shifts of thinking pattern (know-ledge pattern or data pattern),perspectives on data (active or negative),implementation of intelligence(via AI or via big data),bottlenecks of data products development(computing intensive or data intensive),data preparation (data preprocessing or data wrangling),quality of services(performance of services or user experiences),data analysis (explanatory or predictive),evaluation of algorithm(by complexity or by scalability),research paradigm(third paradigm or fourth paradigm) as well as main motivations of the education(in order to cultivate data engineer or data scientist).And then,the top ten trends in data science studies were proposed:to vale predictive models and correlation analysis,to give more attention on model integration and meta-analysis,to embrace data first,model later or never paradigm,to be led by rea-lism and ensure data consistence,to support multi-copies and data locality,the coexistence of varieties in implementation techno logies and integrated applications,to be dominated by simple computing and pragmatism,to develop data products and the embedded applications of data science,to embrace the Pro-Am and metadata,and cultivate data scientist and curriculums or majors.Finally,some suggestions on how do further studies were also proposed.

Key words: Data science,Big data,Data products developement,Data wrangling,Data-driven

[1] WALKER J S,NAIMI AI.Big data:A revolution that will transform how we live,work,and think[J].Mathematics & Computer Education,2013,7(17):181-183.
[2] BOYD D,CRAWFORD K.Critical questions for big data:Pro-vocations for a cultural,technological,and scholarly phenomenon[J].Information,Communication & Society,2012,15(5):662-679.
[3] KITCHIN R.Big data,new epistemologies and paradigm shifts[J].Big Data & Society,2014,1(1):1-12.
[4] JAGADISH H V.Big data and science:myths and reality[J].Big Data Research,2015,2(2):49-52.
[5] PROVOST F,FAWCETT T.Data science and its relationship to big data and data-driven decision making[J].Big Data,2013,1(1):51-59.
[6] NAUR P.Concise survey of computer methods[M].Studentlitteratur AB,1974.
[7] CLEVELAND W S.Data science:an action plan for expanding the technical areas of the field of statistics[J].International Statistical Review,2001,69(1):21-26.
[8] MATTMANN C A.Computing:A vision for data science[J].Nature,2013,493(7433):473-475.
[9] DHAR V.Data science and prediction[J].Communications of the ACM,2013,56(12):64-73.
[10] PATIL D J,DAVENPORT T H.Data scientist:the sexiest job of the 21st century[J].Harvard Business Review,2012,90(10):70-76.
[11] KITCHIN R.Big data and human geography:Opportunities,challenges and risks[J].Dialogues in Human Geography,2013,3(3):262-267.
[12] SMITH M.The White House names Dr,DJ Patil as the first US chief data scientist.https://obamawhitehouse.archives.gov/blog/2015/02/18/white-house-names -dr-dj-patil-first-us-chief-data-scientist.
[13] GARTNER J.Gartner’s 2014 hype cycle for emerging technologies maps the journey to digital business.http://www.gartner.com/newsroom/id/2819918.
[14] GARTNER J.Hype Cycle for Data Science.https://www.gartner.com/doc/3388917/hype-cycle-data-science.
[15] SCHUTT R,O’NEIL C.Doing data science:Straight talk from the frontline[M].O’Reilly Media,Inc.,2013:7.
[16] OVERTON J.Going Pro in Data Science [M].O’Reilly Media,Inc.,2016:12.
[17] 朝乐门.数据科学理论与实践[M].北京:清华大学出版社,2017:15.
[18] GRAY J,CHAMBERS L,BOUNEGRU L.The data journalism handbook:how journalists can use data to improve the news[M].O’Reilly Media,Inc.,2012.
[19] KALIDINDI S R,DE GRAEF M.Materials data science:current status and future outlook[J].Annual Review of Materials Research,2015,45:171-193.
[20] FANG B,ZHANG P.Big Data in Finance[M]∥Big Data Concepts,Theories,and Applications.Springer International Publishing,2016:391-412.
[21] DAVIS K.Ethics of Big Data:Balancing risk and innovation[M].O’Reilly Media,Inc.,2012.
[22] WEST D M.Big data for education:Data mining,data analytics,and web dashboards[J].Governance Studies at Brookings,2012,4:1-10.
[23] LABRINIDIS A,JAGADISH H V.Challenges and opportunities with big data[J].Proceedings of the VLDB Endowment,2012,5(12):2032-2033.
[24] KAISLER S,ARMOUR F,E SPINOSA J A,et al.Big data:Issues and challenges moving forward[C]∥2013 46th Hawaii International Conference on System Sciences (HICSS).IEEE,2013:995-1004.
[25] CHEN H,CHIANG R H L,STOREY V C.Business intelli-gence and analytics:From big data to big impact[J].MIS Quarterly,2012,36(4):1164-1188.
[26] PROVOST F,FAWCETT T.Data science and its relationship to big data and data-driven decision making[J].Big Data,2013,1(1):51-59.
[27] CLEVELAND W S.Data science:an action plan for expanding the technical areas of the field of statistics[J].International Statistical Review,2001,69(1):21-26.
[28] MATTMANN C A.Computing:A vision for data science[J].Nature,2013,493(7433):473-475.
[29] SCHUTT R,O’NEIL C.Doing data science:Straight talk from the frontline[M].O’Reilly Media,Inc.,2013.
[30] SHANAHAN J G,DAI L.Large scale distributed data scienceusing apache spark[C]∥Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2015:2323-2324.
[31] HOLMES A.Hadoop in practice[M].Manning PublicationsCo.,2012.
[32] SHARMA S,SHANDILYA R,PATNAIK S,et al.Leading NoSQL models for handling Big Data:a brief review[J].International Journal of Business Information Systems,2016,22(1):1-25.
[33] SADALAGE P J,FOWLER M.NoSQL distilled:a brief guide to the emerging world of polyglot persistence[M].Pearson Education,2012.
[34] MARX V.Biology:The big challenges of big data[J].Nature,2013,498(7453):255-260.
[35] RAGHUPATHI W,RAGHUPATHI V.Big data analytics inhealthcare:promise and potential[J].Health Information Scien-ce and Systems,2014,2(1):3.
[36] KIM G H,TRIMI S,CHUNG J H.Big-data applications in the government sector[J].Communications of the ACM,2014,57(3):78-85.
[37] DANIEL B.Big data and analytics in higher education:Opportunities and challenges[J].British Journal of Educational Techno-logy,2015,46(5):904-920.
[38] GEORGE G,HAAS M R,PENTLAND A.Big data and ma-nagement[J].Academy of Management Journal,2014,57(2):321-326.
[39] SWAN M.The quantified self:Fundamental disruption in big data science and biological discovery[J].Big Data,2013,1(2):85-99.
[40] LEWIS S C.Journalism in an Era of Big Data:Cases,concepts,and critiques.https:/doi.org/10.1080/21670811.2014.976399.
[41] RAHM E.Big Data Analytics[J].IT-Information Technology,2016,58(4):155-156.
[42] BAUMER B.A data science course for undergraduates:Thin-king with data[J].The American Statistician,2015,69(4):334-342.
[43] HARDIN J,HOERL R,HORTON N J,et al.Data science instatistics curricula:Preparing students to “think with data”[J].The American Statistician,2015,69(4):343-353.
[44] CASSEL L N,POSNER M,DICHEVA D,et al(1)Advancing data science for students of all majors[C]∥Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education.ACM,2017:722.
[45] BERMAN F D,BOURNE P E.Let’s make gender diversity in data science a priority right from the start[J].PLoS biology,2015,13(7):e1002206.
[46] CHAO L.Data Science [M].Tsinghua University Press,2016.
[47] COOPER P.Data,information,knowledge and wisdom[J].Anae-sthesia & Intensive Care Medicine,2014,15(1):44-45.
[48] ERL T,KHATTAK W,BUHLER P.Big data fundamentals:concepts,drivers & techniques[M].Prentice Hall Press,2016.
[49] WANG G,GUNASEKARAN A,NGAI E W T,et al(1)Big data analytics in logistics and supply chain management:Certain investigations for research and applications[J].International Journal of Production Economics,2016,176:98-110.
[50] CARDENAS A A,MANADHATA P K,RAJAN S P.Big data analytics for security[J].IEEE Security & Privacy,2013,11(6):74-76.
[51] RAGHUPATHI W,RAGHUPATHI V.Big data analytics inhealthcare:promise and potential[J].Health Information Science and Systems,2014,2(1):3.
[52] LEEK J T,PENG R D.What is the question? Mistaking the type of question being considered is the most common error in data analysis[J].Science,2015,4(6228):1314-1315.
[53] SWAN M.The quantified self:Fundamental disruption in big data science and biological discovery[J].Big Data,2013,1(2):85-99.
[54] RUCKENSTEIN M,PANTZAR M.Beyond the quantified self:Thematic exploration of a dataistic paradigm[J].New Media & Society,2017,19(3):401-418.
[55] KHATRI V,BROWN C V.Designing data governance[J].Communications of the ACM,2010,53(1):148-152.
[56] KHATRI V,BROWN C V.Designing data governance[J].Communications of the ACM,2010,53(1):148-152.
[57] THOMAS G.The DGI data governance framework.ht-tp://www.datagovernance/the-dgi-framework.
[58] LEE S U,ZHU L,JEFFERY R.Design Choices for Data Go-vernance in Platform Ecosystems:A Contingency Model[J].ar-Xiv preprint arXiv:1706.07560,2017.
[59] CMMI Institute.Data Management Maturity (DMM)? Model.http:∥cmmiinstitute.com/data-management-maturity.
[60] LIU J,LI J,LI W,et al.Rethinking big data:A review on the data quality and usage issues[J].ISPRS Journal of Photogrammetry and Remote Sensing,2016,115:134-142.
[61] LI J Z,WANG H Z,GAO H.State-of-the-Art of Research on Big Data Usability[J].Journal of Software,2016,7(7):1605-1625.(in Chinese) 李建中,王宏志,高宏.大数据可用性的研究进展[J].软件学报,2016,27(7):1605-1625.
[62] RAHM E,DO H H.Data cleaning:Problems and current approaches[J].IEEE Data Engineering Bulletin,2000,23(4):3-13.
[63] WICKHAM H.Tidy data[J].Journal of Statistical Software,2014,59(10):1-23.
[64] LAFUENTE G.The big data security challenge[J].Network Security,2015,5(1):12-14.
[65] PERERA C,RANJAN R,WANG L,et al(1)Big data privacy in the internet of things era[J].IT Professional,2015,17(3):32-39.
[66] PATIL D,NOREN A.Building Data Science Teams:The Skills,Tools and Perspectives Behind Great Data Science Groups[M].O’Reilly,2011.
[67] BANERJEE S.Citizen Data Science for Social Good:Case Stu-dies and Vignettes from Recent Projects.https:∥www,researchgate,net/publication/283119007_Citizen_Data_Science_for_Social_Good_Case_Studies_and_Vignettes_from_Recent_Projects.
[68] PARASIE S,DAGIRAL E.Data-driven journalism and the public good:“Computer-assisted-reporters” and “programmer-journalists” in Chicago[J].New Media & Society,2013,15(6):853-871.
[69] DU D,LI A,ZHANG L.Survey on the applications of big data in Chinese real estate enterprise[J].Procedia Computer Science,2014,30:24-33.
[70] MIDDLETON S E,SHADBOLT N R ,DE ROURE D C.Ontological user profiling in recommender systems[J].ACM Tran-sactions on Information Systems (TOIS),2004,22(1):54-88.
[71] MARSHALL P,TODD B,RHODES M.Ultimate Guide toGoogle AdWords[M].Entrepreneur Press,2014.
[72] GURRIN C,SMEATON A F,DOHERTY A R.Lifelogging:Personal big data[J].Foundations and Trends in Information Retrieval,2014,8(1):1-125.
[73] RAGHUPATHI W,RAGHUPATHI V.Big data analytics inhealthcare:promise and potential[J].Health Information Science and Systems,2014,2(1):3.
[74] MARX V,Biology:The big challenges of big data[J].Nature,2013,498(7453):255-260.
[75] BELLO-ORGAZ G,JUNG J J,CAMACHO D.Social big data:Recent achievements and new challenges[J].Information Fusion,2016,28:45-59.
[76] MOHANTY S,JAGADEESH M,SRIVATSA H.Big data imperatives:Enterprise ‘Big Data’warehouse,‘BI’implementations and analytics[M].Apress,2013.
[77] BERTOT J C,GORHAM U,JAEGER P T,et al.Big data,open government and e-government:Issues,policies and recommendations[J].Information Polity,2014,19(1/2):5-16.
[78] AGGARWAL A.Opportunities and Challenges of Big Data inPublic Sector[M]∥Managing Big Data Integration in the Public Sector.2015:289-301.
[79] MATT T.Big Data Landscape 2016 v18 FINAL.(2016-4-28).http:∥mattturck.com/big-data-landscape-2016-v18-final.
[80] KAISLER S,ARMOUR F,ESPINOSA J A,et al.Big data:Issues and challenges moving forward[C]∥2013 46th Hawaii International Conference on System Sciences (HICSS).IEEE,2013:995-1004.
[81] AL-JARRAH,OMAR Y,et al.Efficient machine learning forbig data:A review[J].Big Data Research,2015,2(3):87-93.
[82] BATRA S.Big data analytics and its reflections on DIKW hie-rarchy[J].Review of Management,2014,4(1/2):5.
[83] DONHOST M J,ANFARA J V A.Data-driven decision making[J].Middle School Journal,2010,42(2):56-63.
[84] CHEN C L P,ZHANG C Y.Data-intensive applications,challenges,techniques and technologies:A survey on Big Data[J].Information Sciences,2014,275:314-347.
[85] VOULGARIZ Z,MAGOULAS G D.Extensions of the k nearest neighbour methods for classification problems[C]∥Proc.of the 26th IASTED International Conference on Artificial Intelligence and Applications (AIA).Innsbruck,Austria,2008,13:23-28.
[86] Datawocky.More data usually beats better algorithms.(2008-03-24).http:∥anand.typepad.com/datawocky/2008/03/more-data-usu al(1)html.
[87] KLEPPMANN,MATRIN.Designing Data-Intensive Applica-tions:The Big Ideas Behind Reliable,Scalable,and Maintainable Systems[M].O’Reilly Media,Inc.,2017.
[88] BREWER E.Parallelism in the Cloud.[2013-06-24].https:∥www.usenix.org/sites/default/files/conference/protected-files/brewer_hotpar13_slides.pdf.
[89] MCAFEE A,BRYNJOLFSSON E,DAVENPORT T H.Big data:the management revolution[J].Harvard Business Review,2012,0(10):60-68.
[90] FAN J Q,HAN F,LIU H.Challenges of big data analysis[J].National Science Review,2014(1/2):293-314.
[91] EDGAR,ROBERT C.MUSCLE:a multiple sequence alignment method with reduced time and space complexity[J].BMC Bioinformatics,2004,5(1):113.
[92] GINSBERG J,MOHEBBI M H,PATEL R S,et al(1)Detecting influenza epidemics using search engine query data[J].Nature,2009,457(7232):1012-1014.
[93] LAZER D,KENNEDY R,KING G,et al.The Parable of Google Flu:Traps in Big Data Analysis[J].Science,2014,343(6176):1203-1205.
[94] HEY T.The fourth paradigm:data-intensive scientific discovery[J].Proceedings of the IEEE,2011,9(8):1334-1337.
[95] PROVOST F,FAWCETT T.Data science and its relationshipto big data and data-driven decision making[J].Big Data,2013,1(1):51-59.
[96] DHAR V,CHOU D.A comparison of nonlinear models for financial prediction[J].IEEE Transactions on Neural Networks,2001,12(4):907-921.
[97] FLLESDAL,DAGFINN.Hermeneutics and the hypothetico-deductive method[J].Dialectica,1979,33(3/4):319-336.
[98] BLUMER A,EHRENFEUCHT A,HAUSSLER D,et al(1)Occam’s razor[J].Information Processing Letters,1987,24(6):377-380.
[99] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[100] LIU Z H,HAMMERSCHMIDT B,MCMAHON D.JSON data management:supporting schema-less development in RDBMS[C]∥Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.ACM,2014:1247-1258.
[101] BREWER E.CAP twelve years later:How the “rules” have changed[J].Computer,2012,45(2):23-29.
[102] ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:Cluster computing with working sets[J].HotCloud,2010,10(10):95.
[103] PLUNKETT,TOM,et al(1)Oracle Big Data Handbook[M].McGraw-Hill Osborne Media,2013.
[104] PATIL D J.Data Jujitsu:the art of turning data into product[M].O’Reilly Media,Inc.,2012.
[105] LEADBEATER C,MILLER P.The Pro-Am revolution:Howenthusiasts are changing our society and economy[M].Demos,2004.
[106] CONWAY D.Data Science in the US Intelligence Community[J].IQT Quarterly,2011,2(4):24-27.
[107] ANDERSON P,MCGUFFEE J,UMINSKY D.Data science as an undergraduate degree[C]∥Proceedings of the 45th ACM Technical Symposium on Computer Science Education.ACM,2014:705-706.
[108] MARSHALL L,ELOFF J H P.Towards an InterdisciplinaryMaster’s Degree Programme in Big Data and Data Science:A South African Perspective[C]∥Annual Conference of the Southern African Computer Lecturers’ Association.Springer International Publishing,2016:131-139.
[109] SUGIMOTO C R,EKBIA H R,MATTIOLI M .The Data Gold Rush in Higher Education[M∥.Big Data Is Not a Monolith.MIT Press,2016:129.
[110] ANDERSON P,BOWRING J,MCCAULEY R,et al.An undergraduate degree in data science:curriculum and a decade of implementation experience[C]∥Proceedings of the 45th ACM Technical Symposium on Computer Science Education.ACM,2014:145-150.
[111] MUENSTERER O J,LACHER M,ZOELLER C,et al.Google Glass in pediatric surgery:an exploratory study [J].Internatio-nal Journal of Surgery,2014,12(4):281-289.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!