Computer Science ›› 2023, Vol. 50 ›› Issue (5): 115-127.doi: 10.11896/jsjkx.220700042

• Database & Big Data & Data Science • Previous Articles     Next Articles

Dataspace:A New Data Organization and Management Model

FAN Shuhuan, HOU Mengshu   

  1. School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China
  • Received:2022-07-04 Revised:2022-11-09 Online:2023-05-15 Published:2023-05-06
  • About author:FAN Shuhuan,born in 1988,Ph.D,is a member of China Computer Federation.Her main research interests include data management and big data analysis.
    HOU Mengshu,born in 1971,professor,Ph.D supervisor,is a senior member of China Computer Federation.His main research interests include data management and natural language processing.
  • Supported by:
    National Key R&D Program of China(2019YFB1705601) and National Natural Science Foundation of China(62072075).

Abstract: With the rapid development of the digital economy,how to realize multi-party data fusion in an untrusted environment and find new ways for data sharing,data analysis and data services in cross-organizational scenarios has become a new problem in the upgrading of social digital industries.Dataspace brings new ideas to solve these problems.The development history of data organization and management is reviewed,and it points out that in the background of big data,systematic research on dataspace is urgent and important.The connotation of dataspace is analyzed and a formal description is given.A big data platform architecture based on dataspace is proposed,and three classic application scenarios are briefly described.Focusing on the construction of dataspace,it analyzes the current correlation research issues and main technical methods from data modeling,dynamic evolution,data query processing,security and privacy,and briefly describes the realization and application of dataspace in different fields.Finally,the research outlook and challenges are prospected from the perspective of multimodal data fusion,efficient query processing,safe data sharing,and the construction of a big data platform based on dataspace.

Key words: Dataspace, Big data, Data sharing, Data modeling, Dynamic evolution, Data query, Security and privacy

CLC Number: 

  • TP311
[1]STONEBRAKER M,CETINTEMEL U.“One size fits all”:anidea whose time has come and gone[C]//21st International Conference on Data Engineering(ICDE'05).IEEE,2005:2-11.
[2]INMON W H.What is a data warehouse[J].Prism Tech Topic,1995,1(1):1-5.
[3]ELMASRI R,NAVATHE S B.Fundamentals of Database Systems[M]//Pearson Education.2009:1104-1107.
[4]SHETH A P,LARSON J A.Federated database systems for managing distributed,heterogeneous,and autonomous databases[J].ACM Computing Surveys(CSUR),1990,22(3):183-236.
[5]SUWANMANEE S,BENSLIMANE D,CHAMPIN P A,et al.Wrapping and Integrating Heterogeneous Relational Data with OWL[C]//ICEIS.2005:11-18.
[6]GADEPALLY V,CHEN P,DUGGAN J,et al.The BigDAWG polystore system and architecture[C]//2016 IEEE High Performance Extreme Computing Conference(HPEC).IEEE,2016:1-6.
[7]GU Z,CORCOGLIONITI F,LANTI D,et al.A Systematic Overview of Data Federation Systems[J/OL].https://content.iospress.com/articles/semantic-web/sw223201.
[8]KHINE P P,WANG Z S.Data lake:a new ideology in big data era[C]//ITM Web of Conferences.2018.
[9]FRANKLIN M,HALEVY A,MAIER D.From databases todataspaces:a new abstraction for information management[J].ACM Sigmod Record,2005,34(4):27-33.
[10]FRANKLIN M,HALEVY A,MAIER D.A first tutorial on dataspaces[C]//Proceedings of the VLDB Endowment.2008:1516-1517.
[11]LI Y K,MENG X F,ZHANG X Y.Research on data space Technology[J].Journal of Software,2008,19(8):2018-2031.
[12]SINGH M,JAIN S K.A survey on dataspace[C]//International Conference on Network Security and Applications.Berlin:Springer,2011:608-621.
[13]XIAO G,CALVANESE D,KONTCHAKOV R,et al.Ontology-based data access:a survey[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.2018:5511-5519.
[14]LI P,CHENG K,JIANG P,et al.Investigation on industrialdataspace for advanced machining workshops:enabling machining operations control with domain knowledge and application case studies[J].Journal of Intelligent Manufacturing,2022,33:103-119.
[15]JIANG P,LIU C,LI P,et al.Industrial dataspace:A broker to run cyber-physical-social production system in level of machining workshops[C]//2019 IEEE 15th International Conference on Automation Science and Engineering(CASE).IEEE,2019:1402-1407.
[16]NADAL FRANCESCH S,RABBANI K,ROMERO MORALÓ,et al.ODIN:A dataspace management system[C]//Procee-dings of the ISWC 2019 Satellite Tracks(Posters & Demonstrations,Industry,and Outrageous Ideas):co-located with 18th International Semantic Web Conference(ISWC 2019).Auckland,New Zealand,2019:185-188.
[17]HERNANDEZ J,MCKENNA L,BRENNAN R.TIKD:A trusted integrated knowledge dataspace for sensitive healthcare data sharing[C]//2021 IEEE 45th Annual Computers,Software,and Applications Conference(COMPSAC).IEEE,2021:1855-1860.
[18]BEN HAMADOU H,GALLINUCCI E,GOLFARELLI M.Answering GPSJ queries in a polystore:A dataspace-based approach[C]//International Conference on Conceptual Modeling.Cham:Springer,2019:189-203.
[19]MATSUBARA M,MIYAMAE T,ITO A,et al.Improving Relia-bility of Data Distribution Across Categories of Business and Industrieswith Chain Data Lineage[J].Fujitsu Scientific & Technical Journal,2020,56(1):52-59.
[20]MÖLLER J,HAHN A.Searching on Heterogeneous and Decentralized Data:A Short Review[C]//3rd International Open Search Symposium.2021.
[21]ZHENG Y.Methodologies for cross-domain data fusion:An over-view[J].IEEE Transactions on Big Data,2015,1(1):16-34.
[22]BLUNSCHI L,DITTRICH J P,GIRARD O R,et al.A dataspace odyssey:The iMeMex personal dataspace management system[C]//CIDR.2007:114-119.
[23]DONG X.Providing best-effort services in dataspace systems[D].Seattle:University of Washington,2007.
[24]ELSAYED I,MUSLIMOVIC A,BREZANY P.Intelligent dataspaces for e-science[C]//7th WSEAS International Conference on Computational Intelligence,Man-Machine Systems and Cybernetics.2008:94-100.
[25]YANG D,SHEN D,NIE T,et al.Layered graph data model for data management of dataspace support platform[C]//International Conference on Web-Age Information Management.Berlin:Springer,2011:353-365.
[26]KOUTRAS C,SIACHAMIS G,IONESCU A,et al.Valentine:Evaluating matching techniques for dataset discovery[C]//2021 IEEE 37th International Conference on Data Engineering(ICDE).IEEE,2021:468-479.
[27]KOUTRAS C,FRAGKOULIS M,KATSIFODIMOS A,et al.REMA:Graph Embeddings-based Relational Schema Matching[C]//EDBT/ICDT Workshops.2020.
[28]HÄTTASCH B,TRUONG-NGOC M,SCHMIDT A,et al.It's AI Match:A Two-Step Approach for Schema Matching Using Embeddings[J].arXiv:2203.04366,2022.
[29]RAHM E,BERNSTEIN P A.A survey of approaches to automatic schema matching[J].the VLDB Journal,2001,10(4):334-350.
[30]COHEN W W,RAVIKUMAR P,FIENBERG S E.A Comparison of String Distance Metrics for Name-Matching Tasks[C]//IIWeb.2003:73-78.
[31]JEFFERY S R,FRANKLIN M J,HALEVY A Y.Pay-as-you-go user feedback for dataspace systems[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.2008:847-860.
[32]DOAN A H,NAUGHTON J,BAID A,et al.The case for astructured approach to managing unstructured data[J].arXiv:0909.1783,2009.
[33]GAL A,ROITMAN H,SHRAGA R.Heterogeneous data integration by learning to rerank schema matches[C]//2018 IEEE International Conference on Data Mining(ICDM).IEEE,2018:959-964.
[34]KHINE P P,WANG Z.A review of polyglotpersistence in the big data world[J].Information,2019,10(4):141.
[35]KOLONKO M,MÜLLENBACH S.Polyglot persistence in conceptual modeling for information analysis[C]//2020 10th International Conference on Advanced Computer Information Technologies(ACIT).IEEE,2020:590-594.
[36]SANTANA L H Z,DOS SANTOS MELLO R.A middlewarefor polyglot persistence of RDF data into NoSQL databases[C]//2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science(IRI).IEEE,2019:237-244.
[37]DAS SARMA A,DONG X,HALEVY A.Bootstrapping pay-as-you-go data integration systems[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.2008:861-874.
[38]SALLES M A V,DITTRICH J P,KARAKASHIAN S K,et al.iTrails:Pay-as-you-go Information Integration in Dataspaces[C]//VLDB.2007:663-674.
[39]HOWE B,MAIER D,RAYNER N,et al.Quarrying dataspaces:Schemaless profiling of unfamiliar information sources[C]//2008 IEEE 24th International Conference on Data Engineering Workshop.IEEE,2008:270-277.
[40]ATZORI M,DESSÌ N.Dataspaces:where structure and schema meet[M]//Learning Structure and Schemas from Documents.Berlin:Springer,2011:97-119.
[41]NÄPPILÄ T,NIEMI T.An approach for developing a schema-less XML dataspace profiling system[J].Journal of Information Science,2012,38(3):234-257.
[42]HILLENBRAND A,SCHERZINGER S,STÖRL U.Remaining in Control of the Impact of Schema Evolution in NoSQL Databases[C]//International Conference on Conceptual Modeling.Cham:Springer,2021:149-159.
[43]STIEMER A,VOGT M,SCHULDT H,et al.PolyMigrate:Dynamic schema evolution and data migration in a distributed polystore[M]//Heterogeneous Data Management,Polystores,and Analytics for Healthcare.Cham:Springer,2020:42-53.
[44]VEINHARDT L I.Schema Inference for NoSQL Databases[D].Czech Republic:Charles University in Prague,2021.
[45] ČONTOŠ P,SVOBODA M.JSON schema inference approaches[C]//International Conference on Conceptual Modeling.Cham:Springer,2020:173-183.
[46]BAAZIZI M A,LAHMAR H B,COLAZZO D,et al.Schema inference for massive JSON datasets[C]//Extending Database Technology(EDBT).2017.
[47]BAAZIZI M A,COLAZZO D,GHELLI G,et al.Parametricschema inference for massive JSON datasets[J].The VLDB Journal,2019,28(4):497-521.
[48]BAAZIZIM A,BERTI C,COLAZZO D,et al.Human-in-the-loop schema inference for massive JSON datasets[C]//23nd International Conference on Extending Database Technology(EDBT 2020).OpenProceedings.org,2020:635-638.
[49]BAAZIZI M A,COLAZZO D,GHELLI G,et al.A typesystem for interactive JSON schema inference[C]//46th International Colloquium on Automata,Languages,and Programming(ICALP 2019).Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,2019.
[50]LBATH H,BONIFATI A,HARMER R.Schema inference for property graphs[C]//24th International Conference on Extending Database Technology(EDBT 2021).2021:499-504.
[51]CURINO C A,MOON H J,ZANIOLO C.Graceful databaseschema evolution:the prism workbench[C]//Proceedings of the VLDB Endowment.2008:761-772.
[52]CURINO C,MOON H J,DEUTSCH A,et al.Automating the database schema evolution process[J].The VLDB Journal,2013,22(1):73-98.
[53]HERRMANN K,VOIGT H,PEDERSEN T B,et al.Multi-schema-version data management:data independence in the twenty-first century[J].The VLDB Journal,2018,27(4):547-571.
[54]HILLENBRAND A,STÖRL U,LEVCHENKO M,et al.To-wards self-adapting data migration in the context of schema evolution in NoSQL databases[C]//2020 IEEE 36th International Conference on Data Engineering Workshops(ICDEW).IEEE,2020:133-138.
[55]DONG X L,HALEVY A,YU C.Data integration with uncertainty[J].The VLDB Journal,2009,18(2):469-500.
[56]MAGNANI M,MONTESI D.Uncertainty in data integration:current approaches and open problems[C]//MUD.2007:18-32.
[57]CHRISTODOULOU K,SERRANO F R S,FERNANDES A A A,et al.Quantifying and propagating uncertainty in automated linked data integration[M]//Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXVII.Berlin:Springer,2018:81-112.
[58]VAN KEULEN M.Managing uncertainty:The road towardsbetter data interoperability[J].IT-Information Technology,2012,54(3):138-146.
[59]CAMACHO-RODRÍGUEZ J,CHAUHAN A,GATES A,et al.Apache hive:From mapreduce to enterprise-grade big data warehousing[C]//Proceedings of the 2019 International Conference on Management of Data.2019:1773-1786.
[60]ARMBRUST M,XIN R S,LIAN C,et al.Spark sql:Relationaldata processing in spark[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of data.2015:1383-1394.
[61]SETHI R,TRAVERSO M,SUNDSTROM D,et al.Presto:SQLon everything[C]//2019 IEEE 35th International Conference on Data Engineering(ICDE).IEEE,2019:1802-1813.
[62]FERNANDES S,BERNARDINO J.What is bigquery?[C]//Proceedings of the 19th International Database Engineering & Applications Symposium.2015:202-203.
[63]DAGEVILLE B,CRUANES T,ZUKOWSKI M,et al.Thesnowflake elastic data warehouse[C]//Proceedings of the 2016 International Conference on Management of DaTA.2016:215-226.
[64]AGUILAR-SABORIT J,RAMAKRISHNAN R,SRINIVASAN K,et al.POLARIS:the distributed SQL engine in azure synapse[C]//Proceedings of the VLDB Endowment.2020:3204-3216.
[65]WANG Y,SONG S,CHEN L.A survey on accessing dataspaces[J].ACM SIGMOD Record,2016,45(2):33-44.
[66]LI Y,MENG X.Research on personal dataspace management[C]//Proceedings of the 2nd SIGMOD PhD Workshop on Innovative Database Research.2008:7-12.
[67]SONG S,CHEN L,YUAN M.Materialization and decomposi-tion of dataspaces for efficient search[J].IEEE Transactions on Knowledge and Data Engineering,2010,23(12):1872-1887.
[68]DONG X,HALEVY A.Indexing dataspaces[C]//Proceedings of the 2007 ACMSIGMOD international conference on Management of data.2007:43-54.
[69]LAL N,SINGH M,PANDEY S,et al.A Proposed Ranked Clustering Approach forUnstructured Data from Dataspace using VSM[C]//2020 20th International Conference onComputa-tional Science and Its Applications(ICCSA).IEEE,2020:80-86.
[70]SALLES M A V,DITTRICH J,BLUNSCHI L.Intensional associations in dataspaces[C]//2010 IEEE 26th International Conference on Data Engineering(ICDE 2010).IEEE,2010:984-987.
[71]YANG D,LI L,ZHANG M.Semantic Keyword Query Mechanism Based on Entity Association Graph in Dataspaces[C]//2012 Fourth International Conference on Computational Intelligence and Communication Networks.IEEE,2012:912-914.
[72]DURNER D,LEIS V,NEUMANN T.JSON tiles:Fast analytics on semi-structured data[C]//Proceedings of the 2021 International Conference on Management of Data.2021:445-458.
[73]KASSELA E,KONSTANTINOU I,KOZIRIS N.Towards aMulti-engine Query Optimizer for Complex SQL Queries on Big Data[C]//2019 IEEE International Conference on Big Data(Big Data).IEEE,2019:6095-6097.
[74]LAL N,QAMAR S.Comparison of ranking algorithms withdataspace[C]//2015 International Conference on Advances in Computer Engineering and Applications.IEEE,2015:565-572.
[75]LAL N,QAMAR S,SHIWANI S.Search ranking for heterogeneous data over dataspace[J].Indian Journal of Science and Technology,2016,9(36):1-9.
[76]CAO Y,FAN W,WANG Y,et al.Querying shared data with security heterogeneity[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020:575-585.
[77]DITTRICH J P.iMeMex:A platform for personal dataspacemanagement[C]//Proceedings of SIGIR PIM Workshop,ACM.2006:40-43.
[78]DONG X L,HALEVY A.A platform for personal information management and integration[C]//Proceedings of VLDB 2005 PhD Workshop.2005:26-30.
[79]CHUNG P W H,LIAO Z.Cross-organisation dataspace(COD)-architecture and implementation[C]//2008 International Conference on Computer Science and Software Engineering.IEEE,2008:448-451.
[80]KUICHEU N C,WANG N,NARCISSE F T G,et al.Building semantic relationships incrementally in DataSpace[C]//2009 First International Conference on Information Science and Engineering.IEEE,2009:2288-2291.
[81]JIANG X,SUN X,ZHUGE H.A Resource space model fordataspace[C]//2010 Sixth International Conference on Semantics,Knowledge and Grids.IEEE,2010:33-41.
[82]ELSAYED I,BREZANY P.Dataspace support platform for e-science[J].Computer Science,2012,13(1):49-61.
[83]ANGRISH A,STARLY B,LEE Y S,et al.A flexible data schema and system architecture for the virtualization of manufactu-ring machines(VMM)[J].Journal of Manufacturing Systems,2017,45:236-247.
[84]MCHUGH J,CUDDIHY P E,WILLIAMS J W,et al.Integra-ted access to big data polystores through a knowledge-driven framework[C]//2017 IEEE International Conference on Big Data(Big Data).IEEE,2017:1494-1503.
[85]ELSAYED I,BREZANY P.Towards large-scale scientific dataspaces for e-science applications[C]//International Conference on Database Systems for Advanced Applications.Berlin:Sprin-ger,2010:69-80.
[86]BORJIGIN C,ZHANG Y,XING C,et al.Dataspace and its application in digital libraries[J].The Electronic Library,2013,31(6):688-702.
[87]MACHADO J C,AMORA P R P.The Impact of Privacy Regulations on DB Systems[J].Journal of Information and Data Management,2021,12(5):428-439.
[88]LIU Y F,WANG N,WANG Z G,et al.Collection and analysis of multi-dimensionalcategory data under the mix-up differential privacy model [J].Journal of Software,2021,33(3):1093-1110.
[89]ZHOU C X,SUN Y,WANG D G,et al.A review of federal learning research[J].Journal of Network and Information Security,2021,7(5):77-92.
[90]GENTRY C.Fully homomorphic encryption using ideal lattices[C]//Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing.2009:169-178.
[91]CUI J J,LONG J,MIN E X,et al.A review of homomorphic encryption applications in cryptographic machine learning [J].Computer Science,2018,45(4):46-52.
[92]WEI S,ZHIHUA H,ZIKAI W,et al.A method and application for constructing a authentic data space[C]//2019 IEEE International Conference on Internet of Things and Intelligence System(IoTaIS).IEEE,2019:218-224.
[93]DOLGUI A,IVANOV D,POTRYASAEV S,et al.Blockchain-oriented dynamic modelling of smart contract design and execution in the supply chain[J].International Journal of Production Research,2020,58(7):2184-2199.
[94]NATHAN S,GOVINDARAJAN C,SARAF A,et al.Block-chain meetsdatabase:Design and implementation of a blockchain relational database[J].arXiv:1903.01919,2019.
[95]BELOTTI M,BOŽIĆ N,PUJOLLE G,et al.A vademecum on blockchain technologies:When,which,and how[J].IEEE Communications Surveys & Tutorials,2019,21(4):3796-3838.
[1] YANG Jian, WANG Kaixuan. Tripartite Evolutionary Game Analysis of Medical Data Sharing Under Blockchain Architecture [J]. Computer Science, 2023, 50(6A): 221000080-7.
[2] HU Xuegang, LI Yang, WANG Lei, LI Peipei, YOU Zhuhong. Key Technologies of Intelligent Identification of Biomarkers:Review of Research on Association Prediction Between Circular RNA and Disease [J]. Computer Science, 2023, 50(4): 369-387.
[3] JIANG Chuanyu, HAN Xiangyu, YANG Wenrui, LYU Bohan, HUANG Xiaoou, XIE Xia, GU Yang. Survey of Medical Knowledge Graph Research and Application [J]. Computer Science, 2023, 50(3): 83-93.
[4] LU Mingchen, LYU Yanqi, LIU Ruicheng, JIN Peiquan. Fast Storage System for Time-series Big Data Streams Based on Waterwheel Model [J]. Computer Science, 2023, 50(1): 25-33.
[5] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[6] CHEN Jing, WU Ling-ling. Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment [J]. Computer Science, 2022, 49(8): 108-112.
[7] WANG Mei-shan, YAO Lan, GAO Fu-xiang, XU Jun-can. Study on Differential Privacy Protection for Medical Set-Valued Data [J]. Computer Science, 2022, 49(4): 362-368.
[8] SUN Xuan, WANG Huan-xiao. Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives [J]. Computer Science, 2022, 49(4): 67-73.
[9] GAO Shi-yao, CHEN Yan-li, XU Yu-lan. Expressive Attribute-based Searchable Encryption Scheme in Cloud Computing [J]. Computer Science, 2022, 49(3): 313-321.
[10] WANG Qing-xu, DONG Li-jun, JIA Wei, LIU Chao, YANG Guang, WU Tie-jun. Vector Representation and Computation Based Dynamic Access Control in Open Environment [J]. Computer Science, 2022, 49(11A): 210900217-7.
[11] LI Hui, HAN Lin, TAO Hong-wei, DONG Ben-song. Study on Office Password Recovery Vectorization Technology Based on Sunway Many-core Processor [J]. Computer Science, 2022, 49(11A): 210900176-5.
[12] LI Hui, HAN Lin, YU Zhe, WANG Wei. Acceleration Method for Multidimensional Function Optimization Based on Artificial Bee Colony Algorithm [J]. Computer Science, 2022, 49(11A): 211200075-6.
[13] ZHANG Kang-wei, ZHANG Jing-wei, YANG Qing, HU Xiao-li, SHAN Mei-jing. DCPFS:Distributed Companion Patterns Mining Framework for Streaming Trajectories [J]. Computer Science, 2022, 49(11A): 211100268-10.
[14] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[15] YU Yue-zhang, XIA Tian-yu, JING Yi-nan, HE Zhen-ying, WANG Xiao-yang. Smart Interactive Guide System for Big Data Analytics [J]. Computer Science, 2021, 48(9): 110-117.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!