计算机科学 ›› 2023, Vol. 50 ›› Issue (5): 115-127.doi: 10.11896/jsjkx.220700042
范淑焕, 侯孟书
FAN Shuhuan, HOU Mengshu
摘要: 随着数字经济的快速发展,如何实现非可信环境下的多方数据融合,为跨组织场景的数据共享、数据分析以及数据服务寻找新途径,成为了社会数字化产业升级中面临的新问题。数据空间为解决这些问题带来了新思路。文中回顾了数据的组织和管理发展历程,指出在大数据背景下数据空间的系统研究具有急迫性和重要性,分析了数据空间的内涵并进行了形式化描述,提出了基于数据空间的大数据平台架构,总结描述了3类经典的应用场景。围绕数据空间的构建工作,从数据建模、动态演变、数据查询处理、安全与隐私拓展方面分析了当前的关联研究问题和主要技术方法,简述了数据空间在不同领域的实现和应用情况。最后从多模态数据融合、高效的查询处理、数据的安全共享及基于数据空间的大数据平台构建分析方面展望了研究前景和挑战。
中图分类号:
[1]STONEBRAKER M,CETINTEMEL U.“One size fits all”:anidea whose time has come and gone[C]//21st International Conference on Data Engineering(ICDE'05).IEEE,2005:2-11. [2]INMON W H.What is a data warehouse[J].Prism Tech Topic,1995,1(1):1-5. [3]ELMASRI R,NAVATHE S B.Fundamentals of Database Systems[M]//Pearson Education.2009:1104-1107. [4]SHETH A P,LARSON J A.Federated database systems for managing distributed,heterogeneous,and autonomous databases[J].ACM Computing Surveys(CSUR),1990,22(3):183-236. [5]SUWANMANEE S,BENSLIMANE D,CHAMPIN P A,et al.Wrapping and Integrating Heterogeneous Relational Data with OWL[C]//ICEIS.2005:11-18. [6]GADEPALLY V,CHEN P,DUGGAN J,et al.The BigDAWG polystore system and architecture[C]//2016 IEEE High Performance Extreme Computing Conference(HPEC).IEEE,2016:1-6. [7]GU Z,CORCOGLIONITI F,LANTI D,et al.A Systematic Overview of Data Federation Systems[J/OL].https://content.iospress.com/articles/semantic-web/sw223201. [8]KHINE P P,WANG Z S.Data lake:a new ideology in big data era[C]//ITM Web of Conferences.2018. [9]FRANKLIN M,HALEVY A,MAIER D.From databases todataspaces:a new abstraction for information management[J].ACM Sigmod Record,2005,34(4):27-33. [10]FRANKLIN M,HALEVY A,MAIER D.A first tutorial on dataspaces[C]//Proceedings of the VLDB Endowment.2008:1516-1517. [11]LI Y K,MENG X F,ZHANG X Y.Research on data space Technology[J].Journal of Software,2008,19(8):2018-2031. [12]SINGH M,JAIN S K.A survey on dataspace[C]//International Conference on Network Security and Applications.Berlin:Springer,2011:608-621. [13]XIAO G,CALVANESE D,KONTCHAKOV R,et al.Ontology-based data access:a survey[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.2018:5511-5519. [14]LI P,CHENG K,JIANG P,et al.Investigation on industrialdataspace for advanced machining workshops:enabling machining operations control with domain knowledge and application case studies[J].Journal of Intelligent Manufacturing,2022,33:103-119. [15]JIANG P,LIU C,LI P,et al.Industrial dataspace:A broker to run cyber-physical-social production system in level of machining workshops[C]//2019 IEEE 15th International Conference on Automation Science and Engineering(CASE).IEEE,2019:1402-1407. [16]NADAL FRANCESCH S,RABBANI K,ROMERO MORALÓ,et al.ODIN:A dataspace management system[C]//Procee-dings of the ISWC 2019 Satellite Tracks(Posters & Demonstrations,Industry,and Outrageous Ideas):co-located with 18th International Semantic Web Conference(ISWC 2019).Auckland,New Zealand,2019:185-188. [17]HERNANDEZ J,MCKENNA L,BRENNAN R.TIKD:A trusted integrated knowledge dataspace for sensitive healthcare data sharing[C]//2021 IEEE 45th Annual Computers,Software,and Applications Conference(COMPSAC).IEEE,2021:1855-1860. [18]BEN HAMADOU H,GALLINUCCI E,GOLFARELLI M.Answering GPSJ queries in a polystore:A dataspace-based approach[C]//International Conference on Conceptual Modeling.Cham:Springer,2019:189-203. [19]MATSUBARA M,MIYAMAE T,ITO A,et al.Improving Relia-bility of Data Distribution Across Categories of Business and Industrieswith Chain Data Lineage[J].Fujitsu Scientific & Technical Journal,2020,56(1):52-59. [20]MÖLLER J,HAHN A.Searching on Heterogeneous and Decentralized Data:A Short Review[C]//3rd International Open Search Symposium.2021. [21]ZHENG Y.Methodologies for cross-domain data fusion:An over-view[J].IEEE Transactions on Big Data,2015,1(1):16-34. [22]BLUNSCHI L,DITTRICH J P,GIRARD O R,et al.A dataspace odyssey:The iMeMex personal dataspace management system[C]//CIDR.2007:114-119. [23]DONG X.Providing best-effort services in dataspace systems[D].Seattle:University of Washington,2007. [24]ELSAYED I,MUSLIMOVIC A,BREZANY P.Intelligent dataspaces for e-science[C]//7th WSEAS International Conference on Computational Intelligence,Man-Machine Systems and Cybernetics.2008:94-100. [25]YANG D,SHEN D,NIE T,et al.Layered graph data model for data management of dataspace support platform[C]//International Conference on Web-Age Information Management.Berlin:Springer,2011:353-365. [26]KOUTRAS C,SIACHAMIS G,IONESCU A,et al.Valentine:Evaluating matching techniques for dataset discovery[C]//2021 IEEE 37th International Conference on Data Engineering(ICDE).IEEE,2021:468-479. [27]KOUTRAS C,FRAGKOULIS M,KATSIFODIMOS A,et al.REMA:Graph Embeddings-based Relational Schema Matching[C]//EDBT/ICDT Workshops.2020. [28]HÄTTASCH B,TRUONG-NGOC M,SCHMIDT A,et al.It's AI Match:A Two-Step Approach for Schema Matching Using Embeddings[J].arXiv:2203.04366,2022. [29]RAHM E,BERNSTEIN P A.A survey of approaches to automatic schema matching[J].the VLDB Journal,2001,10(4):334-350. [30]COHEN W W,RAVIKUMAR P,FIENBERG S E.A Comparison of String Distance Metrics for Name-Matching Tasks[C]//IIWeb.2003:73-78. [31]JEFFERY S R,FRANKLIN M J,HALEVY A Y.Pay-as-you-go user feedback for dataspace systems[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.2008:847-860. [32]DOAN A H,NAUGHTON J,BAID A,et al.The case for astructured approach to managing unstructured data[J].arXiv:0909.1783,2009. [33]GAL A,ROITMAN H,SHRAGA R.Heterogeneous data integration by learning to rerank schema matches[C]//2018 IEEE International Conference on Data Mining(ICDM).IEEE,2018:959-964. [34]KHINE P P,WANG Z.A review of polyglotpersistence in the big data world[J].Information,2019,10(4):141. [35]KOLONKO M,MÜLLENBACH S.Polyglot persistence in conceptual modeling for information analysis[C]//2020 10th International Conference on Advanced Computer Information Technologies(ACIT).IEEE,2020:590-594. [36]SANTANA L H Z,DOS SANTOS MELLO R.A middlewarefor polyglot persistence of RDF data into NoSQL databases[C]//2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science(IRI).IEEE,2019:237-244. [37]DAS SARMA A,DONG X,HALEVY A.Bootstrapping pay-as-you-go data integration systems[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.2008:861-874. [38]SALLES M A V,DITTRICH J P,KARAKASHIAN S K,et al.iTrails:Pay-as-you-go Information Integration in Dataspaces[C]//VLDB.2007:663-674. [39]HOWE B,MAIER D,RAYNER N,et al.Quarrying dataspaces:Schemaless profiling of unfamiliar information sources[C]//2008 IEEE 24th International Conference on Data Engineering Workshop.IEEE,2008:270-277. [40]ATZORI M,DESSÌ N.Dataspaces:where structure and schema meet[M]//Learning Structure and Schemas from Documents.Berlin:Springer,2011:97-119. [41]NÄPPILÄ T,NIEMI T.An approach for developing a schema-less XML dataspace profiling system[J].Journal of Information Science,2012,38(3):234-257. [42]HILLENBRAND A,SCHERZINGER S,STÖRL U.Remaining in Control of the Impact of Schema Evolution in NoSQL Databases[C]//International Conference on Conceptual Modeling.Cham:Springer,2021:149-159. [43]STIEMER A,VOGT M,SCHULDT H,et al.PolyMigrate:Dynamic schema evolution and data migration in a distributed polystore[M]//Heterogeneous Data Management,Polystores,and Analytics for Healthcare.Cham:Springer,2020:42-53. [44]VEINHARDT L I.Schema Inference for NoSQL Databases[D].Czech Republic:Charles University in Prague,2021. [45] ČONTO P,SVOBODA M.JSON schema inference approaches[C]//International Conference on Conceptual Modeling.Cham:Springer,2020:173-183. [46]BAAZIZI M A,LAHMAR H B,COLAZZO D,et al.Schema inference for massive JSON datasets[C]//Extending Database Technology(EDBT).2017. [47]BAAZIZI M A,COLAZZO D,GHELLI G,et al.Parametricschema inference for massive JSON datasets[J].The VLDB Journal,2019,28(4):497-521. [48]BAAZIZIM A,BERTI C,COLAZZO D,et al.Human-in-the-loop schema inference for massive JSON datasets[C]//23nd International Conference on Extending Database Technology(EDBT 2020).OpenProceedings.org,2020:635-638. [49]BAAZIZI M A,COLAZZO D,GHELLI G,et al.A typesystem for interactive JSON schema inference[C]//46th International Colloquium on Automata,Languages,and Programming(ICALP 2019).Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,2019. [50]LBATH H,BONIFATI A,HARMER R.Schema inference for property graphs[C]//24th International Conference on Extending Database Technology(EDBT 2021).2021:499-504. [51]CURINO C A,MOON H J,ZANIOLO C.Graceful databaseschema evolution:the prism workbench[C]//Proceedings of the VLDB Endowment.2008:761-772. [52]CURINO C,MOON H J,DEUTSCH A,et al.Automating the database schema evolution process[J].The VLDB Journal,2013,22(1):73-98. [53]HERRMANN K,VOIGT H,PEDERSEN T B,et al.Multi-schema-version data management:data independence in the twenty-first century[J].The VLDB Journal,2018,27(4):547-571. [54]HILLENBRAND A,STÖRL U,LEVCHENKO M,et al.To-wards self-adapting data migration in the context of schema evolution in NoSQL databases[C]//2020 IEEE 36th International Conference on Data Engineering Workshops(ICDEW).IEEE,2020:133-138. [55]DONG X L,HALEVY A,YU C.Data integration with uncertainty[J].The VLDB Journal,2009,18(2):469-500. [56]MAGNANI M,MONTESI D.Uncertainty in data integration:current approaches and open problems[C]//MUD.2007:18-32. [57]CHRISTODOULOU K,SERRANO F R S,FERNANDES A A A,et al.Quantifying and propagating uncertainty in automated linked data integration[M]//Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXVII.Berlin:Springer,2018:81-112. [58]VAN KEULEN M.Managing uncertainty:The road towardsbetter data interoperability[J].IT-Information Technology,2012,54(3):138-146. [59]CAMACHO-RODRÍGUEZ J,CHAUHAN A,GATES A,et al.Apache hive:From mapreduce to enterprise-grade big data warehousing[C]//Proceedings of the 2019 International Conference on Management of Data.2019:1773-1786. [60]ARMBRUST M,XIN R S,LIAN C,et al.Spark sql:Relationaldata processing in spark[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of data.2015:1383-1394. [61]SETHI R,TRAVERSO M,SUNDSTROM D,et al.Presto:SQLon everything[C]//2019 IEEE 35th International Conference on Data Engineering(ICDE).IEEE,2019:1802-1813. [62]FERNANDES S,BERNARDINO J.What is bigquery?[C]//Proceedings of the 19th International Database Engineering & Applications Symposium.2015:202-203. [63]DAGEVILLE B,CRUANES T,ZUKOWSKI M,et al.Thesnowflake elastic data warehouse[C]//Proceedings of the 2016 International Conference on Management of DaTA.2016:215-226. [64]AGUILAR-SABORIT J,RAMAKRISHNAN R,SRINIVASAN K,et al.POLARIS:the distributed SQL engine in azure synapse[C]//Proceedings of the VLDB Endowment.2020:3204-3216. [65]WANG Y,SONG S,CHEN L.A survey on accessing dataspaces[J].ACM SIGMOD Record,2016,45(2):33-44. [66]LI Y,MENG X.Research on personal dataspace management[C]//Proceedings of the 2nd SIGMOD PhD Workshop on Innovative Database Research.2008:7-12. [67]SONG S,CHEN L,YUAN M.Materialization and decomposi-tion of dataspaces for efficient search[J].IEEE Transactions on Knowledge and Data Engineering,2010,23(12):1872-1887. [68]DONG X,HALEVY A.Indexing dataspaces[C]//Proceedings of the 2007 ACMSIGMOD international conference on Management of data.2007:43-54. [69]LAL N,SINGH M,PANDEY S,et al.A Proposed Ranked Clustering Approach forUnstructured Data from Dataspace using VSM[C]//2020 20th International Conference onComputa-tional Science and Its Applications(ICCSA).IEEE,2020:80-86. [70]SALLES M A V,DITTRICH J,BLUNSCHI L.Intensional associations in dataspaces[C]//2010 IEEE 26th International Conference on Data Engineering(ICDE 2010).IEEE,2010:984-987. [71]YANG D,LI L,ZHANG M.Semantic Keyword Query Mechanism Based on Entity Association Graph in Dataspaces[C]//2012 Fourth International Conference on Computational Intelligence and Communication Networks.IEEE,2012:912-914. [72]DURNER D,LEIS V,NEUMANN T.JSON tiles:Fast analytics on semi-structured data[C]//Proceedings of the 2021 International Conference on Management of Data.2021:445-458. [73]KASSELA E,KONSTANTINOU I,KOZIRIS N.Towards aMulti-engine Query Optimizer for Complex SQL Queries on Big Data[C]//2019 IEEE International Conference on Big Data(Big Data).IEEE,2019:6095-6097. [74]LAL N,QAMAR S.Comparison of ranking algorithms withdataspace[C]//2015 International Conference on Advances in Computer Engineering and Applications.IEEE,2015:565-572. [75]LAL N,QAMAR S,SHIWANI S.Search ranking for heterogeneous data over dataspace[J].Indian Journal of Science and Technology,2016,9(36):1-9. [76]CAO Y,FAN W,WANG Y,et al.Querying shared data with security heterogeneity[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020:575-585. [77]DITTRICH J P.iMeMex:A platform for personal dataspacemanagement[C]//Proceedings of SIGIR PIM Workshop,ACM.2006:40-43. [78]DONG X L,HALEVY A.A platform for personal information management and integration[C]//Proceedings of VLDB 2005 PhD Workshop.2005:26-30. [79]CHUNG P W H,LIAO Z.Cross-organisation dataspace(COD)-architecture and implementation[C]//2008 International Conference on Computer Science and Software Engineering.IEEE,2008:448-451. [80]KUICHEU N C,WANG N,NARCISSE F T G,et al.Building semantic relationships incrementally in DataSpace[C]//2009 First International Conference on Information Science and Engineering.IEEE,2009:2288-2291. [81]JIANG X,SUN X,ZHUGE H.A Resource space model fordataspace[C]//2010 Sixth International Conference on Semantics,Knowledge and Grids.IEEE,2010:33-41. [82]ELSAYED I,BREZANY P.Dataspace support platform for e-science[J].Computer Science,2012,13(1):49-61. [83]ANGRISH A,STARLY B,LEE Y S,et al.A flexible data schema and system architecture for the virtualization of manufactu-ring machines(VMM)[J].Journal of Manufacturing Systems,2017,45:236-247. [84]MCHUGH J,CUDDIHY P E,WILLIAMS J W,et al.Integra-ted access to big data polystores through a knowledge-driven framework[C]//2017 IEEE International Conference on Big Data(Big Data).IEEE,2017:1494-1503. [85]ELSAYED I,BREZANY P.Towards large-scale scientific dataspaces for e-science applications[C]//International Conference on Database Systems for Advanced Applications.Berlin:Sprin-ger,2010:69-80. [86]BORJIGIN C,ZHANG Y,XING C,et al.Dataspace and its application in digital libraries[J].The Electronic Library,2013,31(6):688-702. [87]MACHADO J C,AMORA P R P.The Impact of Privacy Regulations on DB Systems[J].Journal of Information and Data Management,2021,12(5):428-439. [88]LIU Y F,WANG N,WANG Z G,et al.Collection and analysis of multi-dimensionalcategory data under the mix-up differential privacy model [J].Journal of Software,2021,33(3):1093-1110. [89]ZHOU C X,SUN Y,WANG D G,et al.A review of federal learning research[J].Journal of Network and Information Security,2021,7(5):77-92. [90]GENTRY C.Fully homomorphic encryption using ideal lattices[C]//Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing.2009:169-178. [91]CUI J J,LONG J,MIN E X,et al.A review of homomorphic encryption applications in cryptographic machine learning [J].Computer Science,2018,45(4):46-52. [92]WEI S,ZHIHUA H,ZIKAI W,et al.A method and application for constructing a authentic data space[C]//2019 IEEE International Conference on Internet of Things and Intelligence System(IoTaIS).IEEE,2019:218-224. [93]DOLGUI A,IVANOV D,POTRYASAEV S,et al.Blockchain-oriented dynamic modelling of smart contract design and execution in the supply chain[J].International Journal of Production Research,2020,58(7):2184-2199. [94]NATHAN S,GOVINDARAJAN C,SARAF A,et al.Block-chain meetsdatabase:Design and implementation of a blockchain relational database[J].arXiv:1903.01919,2019. [95]BELOTTI M,BOIĆ N,PUJOLLE G,et al.A vademecum on blockchain technologies:When,which,and how[J].IEEE Communications Surveys & Tutorials,2019,21(4):3796-3838. |
[1] | 胡学钢, 李扬, 王磊, 李培培, 尤著宏. 生物标志物智能识别关键技术:环状RNA与疾病关联预测研究综述 Key Technologies of Intelligent Identification of Biomarkers:Review of Research on Association Prediction Between Circular RNA and Disease 计算机科学, 2023, 50(4): 369-387. https://doi.org/10.11896/jsjkx.220500114 |
[2] | 蒋川宇, 韩翔宇, 杨文蕊, 吕博涵, 黄小欧, 谢夏, 谷阳. 医学知识图谱研究与应用综述 Survey of Medical Knowledge Graph Research and Application 计算机科学, 2023, 50(3): 83-93. https://doi.org/10.11896/jsjkx.220700241 |
[3] | 陆铭琛, 吕晏齐, 刘睿诚, 金培权. 基于水车模型的时序大数据快速存储 Fast Storage System for Time-series Big Data Streams Based on Waterwheel Model 计算机科学, 2023, 50(1): 25-33. https://doi.org/10.11896/jsjkx.220900045 |
[4] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[5] | 陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273 |
[6] | 王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032 |
[7] | 孙轩, 王焕骁. 政务大数据安全防护能力建设:基于技术和管理视角的探讨 Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives 计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010 |
[8] | 高诗尧, 陈燕俐, 许玉岚. 云环境下基于属性的多关键字可搜索加密方案 Expressive Attribute-based Searchable Encryption Scheme in Cloud Computing 计算机科学, 2022, 49(3): 313-321. https://doi.org/10.11896/jsjkx.201100214 |
[9] | 王清旭, 董理君, 贾伟, 刘超, 杨光, 吴铁军. 开放式环境下基于向量表征与计算的动态访问控制 Vector Representation and Computation Based Dynamic Access Control in Open Environment 计算机科学, 2022, 49(11A): 210900217-7. https://doi.org/10.11896/jsjkx.210900217 |
[10] | 李辉, 韩林, 陶红伟, 董本松. 基于申威众核处理器的Office口令恢复向量化研究 Study on Office Password Recovery Vectorization Technology Based on Sunway Many-core Processor 计算机科学, 2022, 49(11A): 210900176-5. https://doi.org/10.11896/jsjkx.210900176 |
[11] | 张康威, 张敬伟, 杨青, 胡晓丽, 单美静. DCPFS:分布式轨迹流伴随模式挖掘框架 DCPFS:Distributed Companion Patterns Mining Framework for Streaming Trajectories 计算机科学, 2022, 49(11A): 211100268-10. https://doi.org/10.11896/jsjkx.211100268 |
[12] | 李辉, 韩林, 于哲, 王威. 基于人工蜂群算法的多维函数优化加速方法 Acceleration Method for Multidimensional Function Optimization Based on Artificial Bee Colony Algorithm 计算机科学, 2022, 49(11A): 211200075-6. https://doi.org/10.11896/jsjkx.211200075 |
[13] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 |
[14] | 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳. 面向大数据分析的智能交互向导系统 Smart Interactive Guide System for Big Data Analytics 计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083 |
[15] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130 |
|