计算机科学 ›› 2023, Vol. 50 ›› Issue (5): 115-127.doi: 10.11896/jsjkx.220700042

• 数据库&大数据&数据科学 • 上一篇    下一篇

数据空间:一种新的数据组织和管理模式

范淑焕, 侯孟书   

  1. 电子科技大学计算机科学与工程学院 成都 611731
  • 收稿日期:2022-07-04 修回日期:2022-11-09 出版日期:2023-05-15 发布日期:2023-05-06
  • 通讯作者: 侯孟书(mshou@uestc.edu.cn)
  • 作者简介:(fansh@uestc.edu.cn)
  • 基金资助:
    国家重点研发计划(2019YFB1705601);国家自然科学基金(62072075)

Dataspace:A New Data Organization and Management Model

FAN Shuhuan, HOU Mengshu   

  1. School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China
  • Received:2022-07-04 Revised:2022-11-09 Online:2023-05-15 Published:2023-05-06
  • About author:FAN Shuhuan,born in 1988,Ph.D,is a member of China Computer Federation.Her main research interests include data management and big data analysis.
    HOU Mengshu,born in 1971,professor,Ph.D supervisor,is a senior member of China Computer Federation.His main research interests include data management and natural language processing.
  • Supported by:
    National Key R&D Program of China(2019YFB1705601) and National Natural Science Foundation of China(62072075).

摘要: 随着数字经济的快速发展,如何实现非可信环境下的多方数据融合,为跨组织场景的数据共享、数据分析以及数据服务寻找新途径,成为了社会数字化产业升级中面临的新问题。数据空间为解决这些问题带来了新思路。文中回顾了数据的组织和管理发展历程,指出在大数据背景下数据空间的系统研究具有急迫性和重要性,分析了数据空间的内涵并进行了形式化描述,提出了基于数据空间的大数据平台架构,总结描述了3类经典的应用场景。围绕数据空间的构建工作,从数据建模、动态演变、数据查询处理、安全与隐私拓展方面分析了当前的关联研究问题和主要技术方法,简述了数据空间在不同领域的实现和应用情况。最后从多模态数据融合、高效的查询处理、数据的安全共享及基于数据空间的大数据平台构建分析方面展望了研究前景和挑战。

关键词: 数据空间, 大数据, 数据共享, 数据建模, 动态演变, 数据查询, 安全与隐私

Abstract: With the rapid development of the digital economy,how to realize multi-party data fusion in an untrusted environment and find new ways for data sharing,data analysis and data services in cross-organizational scenarios has become a new problem in the upgrading of social digital industries.Dataspace brings new ideas to solve these problems.The development history of data organization and management is reviewed,and it points out that in the background of big data,systematic research on dataspace is urgent and important.The connotation of dataspace is analyzed and a formal description is given.A big data platform architecture based on dataspace is proposed,and three classic application scenarios are briefly described.Focusing on the construction of dataspace,it analyzes the current correlation research issues and main technical methods from data modeling,dynamic evolution,data query processing,security and privacy,and briefly describes the realization and application of dataspace in different fields.Finally,the research outlook and challenges are prospected from the perspective of multimodal data fusion,efficient query processing,safe data sharing,and the construction of a big data platform based on dataspace.

Key words: Dataspace, Big data, Data sharing, Data modeling, Dynamic evolution, Data query, Security and privacy

中图分类号: 

  • TP311
[1]STONEBRAKER M,CETINTEMEL U.“One size fits all”:anidea whose time has come and gone[C]//21st International Conference on Data Engineering(ICDE'05).IEEE,2005:2-11.
[2]INMON W H.What is a data warehouse[J].Prism Tech Topic,1995,1(1):1-5.
[3]ELMASRI R,NAVATHE S B.Fundamentals of Database Systems[M]//Pearson Education.2009:1104-1107.
[4]SHETH A P,LARSON J A.Federated database systems for managing distributed,heterogeneous,and autonomous databases[J].ACM Computing Surveys(CSUR),1990,22(3):183-236.
[5]SUWANMANEE S,BENSLIMANE D,CHAMPIN P A,et al.Wrapping and Integrating Heterogeneous Relational Data with OWL[C]//ICEIS.2005:11-18.
[6]GADEPALLY V,CHEN P,DUGGAN J,et al.The BigDAWG polystore system and architecture[C]//2016 IEEE High Performance Extreme Computing Conference(HPEC).IEEE,2016:1-6.
[7]GU Z,CORCOGLIONITI F,LANTI D,et al.A Systematic Overview of Data Federation Systems[J/OL].https://content.iospress.com/articles/semantic-web/sw223201.
[8]KHINE P P,WANG Z S.Data lake:a new ideology in big data era[C]//ITM Web of Conferences.2018.
[9]FRANKLIN M,HALEVY A,MAIER D.From databases todataspaces:a new abstraction for information management[J].ACM Sigmod Record,2005,34(4):27-33.
[10]FRANKLIN M,HALEVY A,MAIER D.A first tutorial on dataspaces[C]//Proceedings of the VLDB Endowment.2008:1516-1517.
[11]LI Y K,MENG X F,ZHANG X Y.Research on data space Technology[J].Journal of Software,2008,19(8):2018-2031.
[12]SINGH M,JAIN S K.A survey on dataspace[C]//International Conference on Network Security and Applications.Berlin:Springer,2011:608-621.
[13]XIAO G,CALVANESE D,KONTCHAKOV R,et al.Ontology-based data access:a survey[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.2018:5511-5519.
[14]LI P,CHENG K,JIANG P,et al.Investigation on industrialdataspace for advanced machining workshops:enabling machining operations control with domain knowledge and application case studies[J].Journal of Intelligent Manufacturing,2022,33:103-119.
[15]JIANG P,LIU C,LI P,et al.Industrial dataspace:A broker to run cyber-physical-social production system in level of machining workshops[C]//2019 IEEE 15th International Conference on Automation Science and Engineering(CASE).IEEE,2019:1402-1407.
[16]NADAL FRANCESCH S,RABBANI K,ROMERO MORALÓ,et al.ODIN:A dataspace management system[C]//Procee-dings of the ISWC 2019 Satellite Tracks(Posters & Demonstrations,Industry,and Outrageous Ideas):co-located with 18th International Semantic Web Conference(ISWC 2019).Auckland,New Zealand,2019:185-188.
[17]HERNANDEZ J,MCKENNA L,BRENNAN R.TIKD:A trusted integrated knowledge dataspace for sensitive healthcare data sharing[C]//2021 IEEE 45th Annual Computers,Software,and Applications Conference(COMPSAC).IEEE,2021:1855-1860.
[18]BEN HAMADOU H,GALLINUCCI E,GOLFARELLI M.Answering GPSJ queries in a polystore:A dataspace-based approach[C]//International Conference on Conceptual Modeling.Cham:Springer,2019:189-203.
[19]MATSUBARA M,MIYAMAE T,ITO A,et al.Improving Relia-bility of Data Distribution Across Categories of Business and Industrieswith Chain Data Lineage[J].Fujitsu Scientific & Technical Journal,2020,56(1):52-59.
[20]MÖLLER J,HAHN A.Searching on Heterogeneous and Decentralized Data:A Short Review[C]//3rd International Open Search Symposium.2021.
[21]ZHENG Y.Methodologies for cross-domain data fusion:An over-view[J].IEEE Transactions on Big Data,2015,1(1):16-34.
[22]BLUNSCHI L,DITTRICH J P,GIRARD O R,et al.A dataspace odyssey:The iMeMex personal dataspace management system[C]//CIDR.2007:114-119.
[23]DONG X.Providing best-effort services in dataspace systems[D].Seattle:University of Washington,2007.
[24]ELSAYED I,MUSLIMOVIC A,BREZANY P.Intelligent dataspaces for e-science[C]//7th WSEAS International Conference on Computational Intelligence,Man-Machine Systems and Cybernetics.2008:94-100.
[25]YANG D,SHEN D,NIE T,et al.Layered graph data model for data management of dataspace support platform[C]//International Conference on Web-Age Information Management.Berlin:Springer,2011:353-365.
[26]KOUTRAS C,SIACHAMIS G,IONESCU A,et al.Valentine:Evaluating matching techniques for dataset discovery[C]//2021 IEEE 37th International Conference on Data Engineering(ICDE).IEEE,2021:468-479.
[27]KOUTRAS C,FRAGKOULIS M,KATSIFODIMOS A,et al.REMA:Graph Embeddings-based Relational Schema Matching[C]//EDBT/ICDT Workshops.2020.
[28]HÄTTASCH B,TRUONG-NGOC M,SCHMIDT A,et al.It's AI Match:A Two-Step Approach for Schema Matching Using Embeddings[J].arXiv:2203.04366,2022.
[29]RAHM E,BERNSTEIN P A.A survey of approaches to automatic schema matching[J].the VLDB Journal,2001,10(4):334-350.
[30]COHEN W W,RAVIKUMAR P,FIENBERG S E.A Comparison of String Distance Metrics for Name-Matching Tasks[C]//IIWeb.2003:73-78.
[31]JEFFERY S R,FRANKLIN M J,HALEVY A Y.Pay-as-you-go user feedback for dataspace systems[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.2008:847-860.
[32]DOAN A H,NAUGHTON J,BAID A,et al.The case for astructured approach to managing unstructured data[J].arXiv:0909.1783,2009.
[33]GAL A,ROITMAN H,SHRAGA R.Heterogeneous data integration by learning to rerank schema matches[C]//2018 IEEE International Conference on Data Mining(ICDM).IEEE,2018:959-964.
[34]KHINE P P,WANG Z.A review of polyglotpersistence in the big data world[J].Information,2019,10(4):141.
[35]KOLONKO M,MÜLLENBACH S.Polyglot persistence in conceptual modeling for information analysis[C]//2020 10th International Conference on Advanced Computer Information Technologies(ACIT).IEEE,2020:590-594.
[36]SANTANA L H Z,DOS SANTOS MELLO R.A middlewarefor polyglot persistence of RDF data into NoSQL databases[C]//2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science(IRI).IEEE,2019:237-244.
[37]DAS SARMA A,DONG X,HALEVY A.Bootstrapping pay-as-you-go data integration systems[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.2008:861-874.
[38]SALLES M A V,DITTRICH J P,KARAKASHIAN S K,et al.iTrails:Pay-as-you-go Information Integration in Dataspaces[C]//VLDB.2007:663-674.
[39]HOWE B,MAIER D,RAYNER N,et al.Quarrying dataspaces:Schemaless profiling of unfamiliar information sources[C]//2008 IEEE 24th International Conference on Data Engineering Workshop.IEEE,2008:270-277.
[40]ATZORI M,DESSÌ N.Dataspaces:where structure and schema meet[M]//Learning Structure and Schemas from Documents.Berlin:Springer,2011:97-119.
[41]NÄPPILÄ T,NIEMI T.An approach for developing a schema-less XML dataspace profiling system[J].Journal of Information Science,2012,38(3):234-257.
[42]HILLENBRAND A,SCHERZINGER S,STÖRL U.Remaining in Control of the Impact of Schema Evolution in NoSQL Databases[C]//International Conference on Conceptual Modeling.Cham:Springer,2021:149-159.
[43]STIEMER A,VOGT M,SCHULDT H,et al.PolyMigrate:Dynamic schema evolution and data migration in a distributed polystore[M]//Heterogeneous Data Management,Polystores,and Analytics for Healthcare.Cham:Springer,2020:42-53.
[44]VEINHARDT L I.Schema Inference for NoSQL Databases[D].Czech Republic:Charles University in Prague,2021.
[45] ČONTOŠ P,SVOBODA M.JSON schema inference approaches[C]//International Conference on Conceptual Modeling.Cham:Springer,2020:173-183.
[46]BAAZIZI M A,LAHMAR H B,COLAZZO D,et al.Schema inference for massive JSON datasets[C]//Extending Database Technology(EDBT).2017.
[47]BAAZIZI M A,COLAZZO D,GHELLI G,et al.Parametricschema inference for massive JSON datasets[J].The VLDB Journal,2019,28(4):497-521.
[48]BAAZIZIM A,BERTI C,COLAZZO D,et al.Human-in-the-loop schema inference for massive JSON datasets[C]//23nd International Conference on Extending Database Technology(EDBT 2020).OpenProceedings.org,2020:635-638.
[49]BAAZIZI M A,COLAZZO D,GHELLI G,et al.A typesystem for interactive JSON schema inference[C]//46th International Colloquium on Automata,Languages,and Programming(ICALP 2019).Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,2019.
[50]LBATH H,BONIFATI A,HARMER R.Schema inference for property graphs[C]//24th International Conference on Extending Database Technology(EDBT 2021).2021:499-504.
[51]CURINO C A,MOON H J,ZANIOLO C.Graceful databaseschema evolution:the prism workbench[C]//Proceedings of the VLDB Endowment.2008:761-772.
[52]CURINO C,MOON H J,DEUTSCH A,et al.Automating the database schema evolution process[J].The VLDB Journal,2013,22(1):73-98.
[53]HERRMANN K,VOIGT H,PEDERSEN T B,et al.Multi-schema-version data management:data independence in the twenty-first century[J].The VLDB Journal,2018,27(4):547-571.
[54]HILLENBRAND A,STÖRL U,LEVCHENKO M,et al.To-wards self-adapting data migration in the context of schema evolution in NoSQL databases[C]//2020 IEEE 36th International Conference on Data Engineering Workshops(ICDEW).IEEE,2020:133-138.
[55]DONG X L,HALEVY A,YU C.Data integration with uncertainty[J].The VLDB Journal,2009,18(2):469-500.
[56]MAGNANI M,MONTESI D.Uncertainty in data integration:current approaches and open problems[C]//MUD.2007:18-32.
[57]CHRISTODOULOU K,SERRANO F R S,FERNANDES A A A,et al.Quantifying and propagating uncertainty in automated linked data integration[M]//Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXVII.Berlin:Springer,2018:81-112.
[58]VAN KEULEN M.Managing uncertainty:The road towardsbetter data interoperability[J].IT-Information Technology,2012,54(3):138-146.
[59]CAMACHO-RODRÍGUEZ J,CHAUHAN A,GATES A,et al.Apache hive:From mapreduce to enterprise-grade big data warehousing[C]//Proceedings of the 2019 International Conference on Management of Data.2019:1773-1786.
[60]ARMBRUST M,XIN R S,LIAN C,et al.Spark sql:Relationaldata processing in spark[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of data.2015:1383-1394.
[61]SETHI R,TRAVERSO M,SUNDSTROM D,et al.Presto:SQLon everything[C]//2019 IEEE 35th International Conference on Data Engineering(ICDE).IEEE,2019:1802-1813.
[62]FERNANDES S,BERNARDINO J.What is bigquery?[C]//Proceedings of the 19th International Database Engineering & Applications Symposium.2015:202-203.
[63]DAGEVILLE B,CRUANES T,ZUKOWSKI M,et al.Thesnowflake elastic data warehouse[C]//Proceedings of the 2016 International Conference on Management of DaTA.2016:215-226.
[64]AGUILAR-SABORIT J,RAMAKRISHNAN R,SRINIVASAN K,et al.POLARIS:the distributed SQL engine in azure synapse[C]//Proceedings of the VLDB Endowment.2020:3204-3216.
[65]WANG Y,SONG S,CHEN L.A survey on accessing dataspaces[J].ACM SIGMOD Record,2016,45(2):33-44.
[66]LI Y,MENG X.Research on personal dataspace management[C]//Proceedings of the 2nd SIGMOD PhD Workshop on Innovative Database Research.2008:7-12.
[67]SONG S,CHEN L,YUAN M.Materialization and decomposi-tion of dataspaces for efficient search[J].IEEE Transactions on Knowledge and Data Engineering,2010,23(12):1872-1887.
[68]DONG X,HALEVY A.Indexing dataspaces[C]//Proceedings of the 2007 ACMSIGMOD international conference on Management of data.2007:43-54.
[69]LAL N,SINGH M,PANDEY S,et al.A Proposed Ranked Clustering Approach forUnstructured Data from Dataspace using VSM[C]//2020 20th International Conference onComputa-tional Science and Its Applications(ICCSA).IEEE,2020:80-86.
[70]SALLES M A V,DITTRICH J,BLUNSCHI L.Intensional associations in dataspaces[C]//2010 IEEE 26th International Conference on Data Engineering(ICDE 2010).IEEE,2010:984-987.
[71]YANG D,LI L,ZHANG M.Semantic Keyword Query Mechanism Based on Entity Association Graph in Dataspaces[C]//2012 Fourth International Conference on Computational Intelligence and Communication Networks.IEEE,2012:912-914.
[72]DURNER D,LEIS V,NEUMANN T.JSON tiles:Fast analytics on semi-structured data[C]//Proceedings of the 2021 International Conference on Management of Data.2021:445-458.
[73]KASSELA E,KONSTANTINOU I,KOZIRIS N.Towards aMulti-engine Query Optimizer for Complex SQL Queries on Big Data[C]//2019 IEEE International Conference on Big Data(Big Data).IEEE,2019:6095-6097.
[74]LAL N,QAMAR S.Comparison of ranking algorithms withdataspace[C]//2015 International Conference on Advances in Computer Engineering and Applications.IEEE,2015:565-572.
[75]LAL N,QAMAR S,SHIWANI S.Search ranking for heterogeneous data over dataspace[J].Indian Journal of Science and Technology,2016,9(36):1-9.
[76]CAO Y,FAN W,WANG Y,et al.Querying shared data with security heterogeneity[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020:575-585.
[77]DITTRICH J P.iMeMex:A platform for personal dataspacemanagement[C]//Proceedings of SIGIR PIM Workshop,ACM.2006:40-43.
[78]DONG X L,HALEVY A.A platform for personal information management and integration[C]//Proceedings of VLDB 2005 PhD Workshop.2005:26-30.
[79]CHUNG P W H,LIAO Z.Cross-organisation dataspace(COD)-architecture and implementation[C]//2008 International Conference on Computer Science and Software Engineering.IEEE,2008:448-451.
[80]KUICHEU N C,WANG N,NARCISSE F T G,et al.Building semantic relationships incrementally in DataSpace[C]//2009 First International Conference on Information Science and Engineering.IEEE,2009:2288-2291.
[81]JIANG X,SUN X,ZHUGE H.A Resource space model fordataspace[C]//2010 Sixth International Conference on Semantics,Knowledge and Grids.IEEE,2010:33-41.
[82]ELSAYED I,BREZANY P.Dataspace support platform for e-science[J].Computer Science,2012,13(1):49-61.
[83]ANGRISH A,STARLY B,LEE Y S,et al.A flexible data schema and system architecture for the virtualization of manufactu-ring machines(VMM)[J].Journal of Manufacturing Systems,2017,45:236-247.
[84]MCHUGH J,CUDDIHY P E,WILLIAMS J W,et al.Integra-ted access to big data polystores through a knowledge-driven framework[C]//2017 IEEE International Conference on Big Data(Big Data).IEEE,2017:1494-1503.
[85]ELSAYED I,BREZANY P.Towards large-scale scientific dataspaces for e-science applications[C]//International Conference on Database Systems for Advanced Applications.Berlin:Sprin-ger,2010:69-80.
[86]BORJIGIN C,ZHANG Y,XING C,et al.Dataspace and its application in digital libraries[J].The Electronic Library,2013,31(6):688-702.
[87]MACHADO J C,AMORA P R P.The Impact of Privacy Regulations on DB Systems[J].Journal of Information and Data Management,2021,12(5):428-439.
[88]LIU Y F,WANG N,WANG Z G,et al.Collection and analysis of multi-dimensionalcategory data under the mix-up differential privacy model [J].Journal of Software,2021,33(3):1093-1110.
[89]ZHOU C X,SUN Y,WANG D G,et al.A review of federal learning research[J].Journal of Network and Information Security,2021,7(5):77-92.
[90]GENTRY C.Fully homomorphic encryption using ideal lattices[C]//Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing.2009:169-178.
[91]CUI J J,LONG J,MIN E X,et al.A review of homomorphic encryption applications in cryptographic machine learning [J].Computer Science,2018,45(4):46-52.
[92]WEI S,ZHIHUA H,ZIKAI W,et al.A method and application for constructing a authentic data space[C]//2019 IEEE International Conference on Internet of Things and Intelligence System(IoTaIS).IEEE,2019:218-224.
[93]DOLGUI A,IVANOV D,POTRYASAEV S,et al.Blockchain-oriented dynamic modelling of smart contract design and execution in the supply chain[J].International Journal of Production Research,2020,58(7):2184-2199.
[94]NATHAN S,GOVINDARAJAN C,SARAF A,et al.Block-chain meetsdatabase:Design and implementation of a blockchain relational database[J].arXiv:1903.01919,2019.
[95]BELOTTI M,BOŽIĆ N,PUJOLLE G,et al.A vademecum on blockchain technologies:When,which,and how[J].IEEE Communications Surveys & Tutorials,2019,21(4):3796-3838.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!