计算机科学 ›› 2022, Vol. 49 ›› Issue (5): 120-128.doi: 10.11896/jsjkx.210300092

• 数据库&大数据&数据科学 • 上一篇    下一篇

面向数据融合的多粒度数据溯源方法

杨斐斐, 沈思妤, 申德荣, 聂铁铮, 寇月   

  1. 东北大学计算机科学与工程学院 沈阳110169
  • 收稿日期:2021-03-09 修回日期:2021-10-22 出版日期:2022-05-15 发布日期:2022-05-06
  • 通讯作者: 申德荣(shendr@mail.neu.edu.cn)
  • 作者简介:(2164283712@qq.com)
  • 基金资助:
    国家自然科学基金(62072084,62072086);国家重点研发计划(2018YFB1003404)

Method on Multi-granularity Data Provenance for Data Fusion

YANG Fei-fei, SHEN Si-yu, SHEN De-rong, NIE Tie-zheng, KOU Yue   

  1. College of Computer Science and Engineering,Northeastern University,Shenyang 110169,China
  • Received:2021-03-09 Revised:2021-10-22 Online:2022-05-15 Published:2022-05-06
  • About author:YANG Fei-fei,born in 1998,postgra-duate.Her main research interests include data integration and data provenance.
    SHEN De-rong,born in 1964,professor,Ph.D supervisor,is a senior member of China Computer Federation.Her main research interests include Web data processing and distributed database.
  • Supported by:
    National Natural Science Foundation of China(62072084, 62072086) and National Key R & D Program of China(2018YFB1003404).

摘要: 随着数据量的增加、数据间的关联和交叉,需要通过数据融合来实现数据的价值最大化。然而,由于数据融合过程复杂,为清晰解释数据融合过程,建立数据融合的回溯机制十分必要。虽然对数据溯源的研究很多,但大多是面向查询和工作流的溯源研究,而面向数据融合的溯源研究很少。文中面向数据融合溯源展开研究,提出了一种支持多粒度数据溯源的方法。首先,对数据融合过程进行抽象,以实体为核心构建模式、实体和属性的语义图,将数据融合过程语义化,并提出优化的溯源信息存储模式;然后,基于语义图,分别提出了实体级和属性级的溯源查询算法,以及相应的查询优化策略;最后,通过实验证明了提出的数据溯源方法的有效性。

关键词: 多粒度, 数据融合, 数据溯源

Abstract: As the amount of data increases,correlates and crosses between data,the value of data needs to be maximized through data fusion.However,due to the complexity of the data fusion process,to clearly explain the data fusion process,it is necessary to establish a backtracking mechanism for data fusion.Although many researches are focused on data provenance,most of them are based on query and workflow,and few of them are for data fusion.This paper focuses on the provenance of data fusion,and proposes a method to support multi-granularity provenance.Firstly,the data fusion process is abstracted,and the semantic graphs of patterns,entities and attributes are constructed with the entity as the core,and an optimized model for storing storage provenance information is proposed.Secondly,on the basis of the semantic graph,the data provenance query algorithms at the entity level and the attribute level are proposed respectively,and the corresponding query optimization strategy are also proposed.Finally,experiments demonstrate the effectiveness of the proposed data provenance method.

Key words: Data fusion, Data provenance, Multi-granularity

中图分类号: 

  • TP311.13
[1]MENG X F,DU Z J.Research on Big Data Fusion:Problemsand challenges[J].Journal of Computer Research and Development,2016,53(2):231-246.
[2]WANG S,PENG Y W,LAN H,et al.Development and Prospect of data integration methods[J].Acta Software,2020,31(3):893-908.
[3]HERSCHEL M,DIESTELKÄMPER R,BEN LAHMAR H.A survey on provenance:What for? What form? What from?[J].Vldb Journal,2017,26(5):1-26.
[4]IKEDA R,PARK H,WIDOM J.Provenance for GeneralizedMap and Reduce Workflows[C]//Fifth Biennial Conference on Innovative Data Systems Research.Asilomar,CA,USA,2011:273-283.
[5]BUTT A S,FITCH P.ProvONE+:A Provenance Model for Scientific Workflows[C]//Web Information Systems Enginee-ring-WISE 2020.Cham:Springer,2020:431-444.
[6]AKOUSH S,SOHAN R,HOPPER A.HadoopProv:towardsprovenance as a first class citizen in MapReduce[C]//Usenix Workshop on the Theory and Practice of Provenance.USENIX Association,2013.
[7]LOGOTHETIS D,DE S,YOCUM K.Scalable lineage capturefor debugging DISC analytics[C]//Symposium on Cloud Computing.ACM,2013.
[8]INTERLANDI M,SHAH K,TETALI S D,et al.Titian:data provenance support in Spark[J].Proceedings of the Vldb Endowment,2015,9(3):216-227.
[9]DEUTCH D,GILAD A,MOSKOVITCH Y.Selective prove-nance for datalog programs using top-k queries[J].Proceedings of the VLDB Endowment,2015,8(12):1394-1405.
[10]CHENEY J,CHITICARIU L,TAN W C.Provenance in Databases:Why,How,and Where[J].Foundations & Trends in Databases,2010,1(4):379-474.
[11]HERSCHEL M.A Hybrid Approach to Answering Why-NotQuestions on Relational Query Results[J].Journal of Data & Information Quality,2015,5(3):1-29.
[12]XUE J X,SHEN D R,KOU Y,et al.Semirring Provenance for Data Fusion[J].Journal of Computer Research and Development,2016,53(2):316-325.
[13]MISSIER P,BELHAJJAME K,CHENEY J.The W3C PROVfamily of specifications for modelling provenance metadata[C]//Proceedings of EDBT.2013:773-776.
[14]NIU X,KAPOOR R,GLAVIC B,et al.Interoperability forprovenance-aware databases using PROV and JSON[C]//Usenix Conference on Theory and Practice of Provenance.USENIX Association,2015.
[15]ALOMEIR O,LAI E Y,MILANI M,et al.The Pastwatch:On the usability of provenance data in relational databases[C]//2020 IEEE 36th International Conference on Data Engineering (ICDE).IEEE,2020:1882-1885.
[16]CUI Y,WIDOM J.Lineage tracing for general data warehouse transformations[J].The VLDB Journal,2003,12(1):41-58.
[17]BUNEMAN P,KHANNA S,TAN W C.Why and Where:ACharacterization of Data Provenance[C]//International Confe-rence on Database Theory.Berlin:Springer,2001:316-330.
[18]GREEN T J,KARVOUNARAKIS G,TANNEN V.Provenance semirings[C]//Twenty-Sixth ACM Sigmod-Sigact-Sigart Symposium on Principles of Database Systems.ACM,2007:31-40.
[19]DONG X,LAURE B E,SRIVASTAVA D.Integratingconflic-ting data:The role of source dependence[J].Proceedings of VLDB Endowment,2009,2(1):550-561.
[1] 秦琪琦, 张月琴, 王润泽, 张泽华.
基于知识图谱的层次粒化推荐方法
Hierarchical Granulation Recommendation Method Based on Knowledge Graph
计算机科学, 2022, 49(8): 64-69. https://doi.org/10.11896/jsjkx.210600111
[2] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[3] 张源, 康乐, 宫朝辉, 张志鸿.
基于Bi-LSTM的期货市场关联交易行为检测方法
Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM
计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[4] 胡艳丽, 童谭骞, 张啸宇, 彭娟.
融入自注意力机制的深度学习情感分析方法
Self-attention-based BGRU and CNN for Sentiment Analysis
计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063
[5] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[6] 王栋, 周大可, 黄有达, 杨欣.
基于多尺度多粒度特征的行人重识别
Multi-scale Multi-granularity Feature for Pedestrian Re-identification
计算机科学, 2021, 48(7): 238-244. https://doi.org/10.11896/jsjkx.200600043
[7] 李艳, 范斌, 郭劼, 林梓源, 赵曌.
基于k-原型聚类和粗糙集的属性约简方法
Attribute Reduction Method Based on k-prototypes Clustering and Rough Sets
计算机科学, 2021, 48(6A): 342-348. https://doi.org/10.11896/jsjkx.201000053
[8] 王政, 姜春茂.
一种基于三支决策的云任务调度优化算法
Cloud Task Scheduling Algorithm Based on Three-way Decisions
计算机科学, 2021, 48(6A): 420-426. https://doi.org/10.11896/jsjkx.201000023
[9] 吕乐宾, 刘群, 彭露, 邓维斌, 王崇宇.
结合多粒度信息的文本匹配融合模型
Text Matching Fusion Model Combining Multi-granularity Information
计算机科学, 2021, 48(6): 196-201. https://doi.org/10.11896/jsjkx.200700100
[10] 丁玲, 向阳.
基于分层次多粒度语义融合的中文事件检测
Chinese Event Detection with Hierarchical and Multi-granularity Semantic Fusion
计算机科学, 2021, 48(5): 202-208. https://doi.org/10.11896/jsjkx.200800038
[11] 周晓进, 徐陈铭, 阮彤.
面向中文电子病历的多粒度医疗实体识别
Multi-granularity Medical Entity Recognition for Chinese Electronic Medical Records
计算机科学, 2021, 48(4): 237-242. https://doi.org/10.11896/jsjkx.200100036
[12] 陈卓, 王国胤, 刘群.
结合多粒度特征融合的自然场景文本检测方法
Natural Scene Text Detection Algorithm Combining Multi-granularity Feature Fusion
计算机科学, 2021, 48(12): 243-248. https://doi.org/10.11896/jsjkx.201000154
[13] 徐堃, 付印金, 陈卫卫, 张亚男.
基于区块链的云存储安全研究进展
Research Progress on Blockchain-based Cloud Storage Security Mechanism
计算机科学, 2021, 48(11): 102-115. https://doi.org/10.11896/jsjkx.210600015
[14] 薛占熬, 孙冰心, 侯昊东, 荆萌萌.
基于多粒度粗糙直觉犹豫模糊集的最优粒度选择方法
Optimal Granulation Selection Method Based on Multi-granulation Rough Intuitionistic Hesitant Fuzzy Sets
计算机科学, 2021, 48(10): 98-106. https://doi.org/10.11896/jsjkx.200800074
[15] 薛占熬, 张敏, 赵丽平, 李永祥.
集对优势关系下多粒度决策粗糙集的可变三支决策模型
Variable Three-way Decision Model of Multi-granulation Decision Rough Sets Under Set-pair Dominance Relation
计算机科学, 2021, 48(1): 157-166. https://doi.org/10.11896/jsjkx.191200175
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!