计算机科学 ›› 2018, Vol. 45 ›› Issue (3): 311-316.doi: 10.11896/j.issn.1002-137X.2018.03.051

• 交叉与前沿 • 上一篇    下一篇

一种基于CFDs规则的修复序列快速判定方法

王欢,张云峰,张艳   

  1. 北华航天工业学院科学技术处 河北 廊坊065000,北华航天工业学院计算机与遥感信息技术学院 河北 廊坊065000,北华航天工业学院计算机与遥感信息技术学院 河北 廊坊065000
  • 出版日期:2018-03-15 发布日期:2018-11-13
  • 基金资助:
    本文受河北省自然科学基金(F2014409008), 河北省科技计划项目(17210336),廊坊市科技计划项目(2017011042)资助

Rapid Decision Method for Repairing Sequence Based on CFDs

WANG Huan, ZHANG Yun-feng and ZHANG Yan   

  • Online:2018-03-15 Published:2018-11-13

摘要: 数据一致性是大数据质量管理研究的一个重要内容。条件函数依赖(CFDs)是维护数据一致性的有效技术手段。然而,在修复过程中选择不同的CFDs修复顺序,会影响修复的准确性和效率。因此,如何选取一个正确且合理的修复顺序对数据修复至关重要。针对该问题,提出一种基于CFDs规则的快速判定修复序列的计算方法。首先,设计了一种数据修复框架。然后,利用CFDs之间的关联关系,提出了修复序列图的概念, 以用于CFDs修复顺序的计算。一方面,可以避免某些错误的或者不必要的数据修复,提高修复的准确性。另一方面,使用规则来判定修复顺序比使用实际数据进行判定更为快速。此外,在判定修复序列的过程中,对修复死锁进行了检测,保证了修复过程的可终止性。最后,通过在真实数据集上与现有方法进行对比实验,证明了所提方法具有更高的准确性和运行效率。

关键词: 数据一致性,条件函数依赖,修复序列

Abstract: Data consistency is one central issue of big data quality management research.Conditional functional depen-dencies (CFDs) are effective techniques for maintaining data consistency.In practice,different repairing sequences may affect precision and efficiency of data repairing.It is critical to select an appropriate repairing sequence.To solve the problem,based on CFDs,this paper presented a rapid decision method for repairing sequence.Firstly,a framework is designed for consistency repairing.Then,by analyzing the association between constraints,the concept of repairing sequence graph is presented to determine repairing sequence on CFDs.It contributes to avoiding some incorrect and unnecessary repairs,which can improve the accuracy of repairing.Meanwhile,repairing sequence with rules runs faster than that with real data.Furthermore,in the process of repairing sequence decision,repairing-deadlock detection is implemented to ensure the termination of repairing.Finally,compared with the existing method,this solution is more accurate and efficient evidenced by the empirical evaluation on two real-life datasets.

Key words: Data consistency,Conditional functional dependencies (CFDs),Repairing sequence

[1] FAN W,GEERTS F.Foundations of data quality management[M].Synthesis Lectures on Data Management,Morgan & Claypool Publishers,2012.
[2] ECKERSON W W.Data warehousing special report:data quality and the bottom line.http://www.adtmag.com/aspx?id=6321.
[3] BOHANNON P,FAN W,GEERTS F,et al.Conditional func-tional dependencies for data cleaning[C]∥Proceedings of the 2007 IEEE International Conference on Data Engineering.2007:746-755.
[4] WANG J,TANG N.Towards dependable data repairing withfixing rules[C]∥Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.2014:124-136.
[5] HUANG Y,GUERRA-HOLLSTEIN J D,BRUSILOVSKY P.Modeling skill combination patterns for deeper knowledge tra-cing[C]∥Proceedings of the 2016 Personalization Approaches in Learning Environments.2016:359-368.
[6] BOHANNON P,FAN W,FLASTER M,et al.A cost-basedmodel and effective heuristic for repairing constraints by value modification[C]∥Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data.Baltimore,Mary-land,2005:143-154.
[7] FEI C,MILLER R J.A unified model for data and constraint repair[C]∥Proceedings of the 2014 IEEE International Confe-rence on Data Engineering.2011:446-457.
[8] JIN C Q,LIU H P,ZHOU A Y.Functional dependency and conditional constraint based data repair[J].Journal of Software,2016,27(7):1671-1684.(in Chinese) 金澈清,刘辉平,周傲英.基于函数依赖与条件约束的数据修复方法[J].软件学报,2016,27(7):1671-1684.
[9] ZHANG X Y,MENG X F,MA Z M,et al.Attribute weightevaluation approach based on approximate functional dependencies[J].Computer Science,2013,40(2):172-176.(in Chinese) 张霄雁,孟祥福,马宗民,等.基于近似函数依赖的关系数据属性权重评估方法[J].计算机科学,2013,40(2):172-176.
[10] HAN J Y,CHEN K J.Ranking data quality of web article content by extracting facts[J].Computer Science,2014,41(11):247-251.(in Chinese) 韩京宇,陈可佳.基于事实抽取的Web文档内容数据质量评估[J].计算机科学,2014,41(11):247-251.
[11] EBAID A,ELMAGARMID A,ILYAS I F,et al.NADEEF:ageneralized data cleaning system[J].Proceedings of the 2013 VLDB Endowment,2013,6(12):1218-1221.
[12] FEI C,MILLER R J.Discovering data quality rules[J].Procee-dings of the 2008 VLDB Endowment,2008,1(1):1166-1177.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[5] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[6] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[7] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[8] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[9] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .
[10] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105, 130 .