计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 124-132.doi: 10.11896/jsjkx.230900003

• 数据库&大数据&数据科学 • 上一篇    下一篇

面向多样化数据清洗任务的证据集智能选择方法

钱泽凯, 丁小欧, 孙哲, 王宏志, 张岩   

  1. 哈尔滨工业大学计算机科学与技术学院 哈尔滨 150006
  • 收稿日期:2023-09-01 修回日期:2023-12-06 出版日期:2024-08-15 发布日期:2024-08-13
  • 通讯作者: 王宏志(wangzh@hit.edu.cn)
  • 作者简介:(qzk010728@gmail.com)
  • 基金资助:
    国家重点研发计划(2021YFB3300502);国家自然科学基金(62232005,62202126);中国博士后科学基金(2022M720957);黑龙江省博士后资助项目(LBH-Z21137)

Intelligent Evidence Set Selection Method for Diverse Data Cleaning Tasks

QIAN Zekai, DING Xiaoou, SUN Zhe, WANG Hongzhi, ZHANG Yan   

  1. College of Computer Science and Technology,Harbin Institute of Technology,Harbin 150006,China
  • Received:2023-09-01 Revised:2023-12-06 Online:2024-08-15 Published:2024-08-13
  • About author:QIAN Zekai,born in 2001,postgra-duate,is a member of CCF(No.P8213G).His main research interests include data governance and so on.
    WANG Hongzhi,born in 1978,Ph.D,professor,is a member of CCF(No.07132D).His main research interests include databases,big data management and analysis,and big data governance.
  • Supported by:
    National Key Research and Development Program of China(2021YFB3300502),National Natural Science Foundation of China(62232005,62202126),China Postdoctoral Science Foundation(2022M720957) and Heilongjiang Postdoctoral Financial Assistance Program(LBH-Z21137).

摘要: 由于针对单一特定数据质量问题而设计的数据清洗算法并不总能有效地适用于多种清洗需求共存的数据质量提升技术,因此可采用多种清洗方法互相配合的方式来解决各种数据清洗需求。将数据清洗问题转换为证据集的生成和选择问题,基于聚合查询的增量式质量评估方案和基于中间算子证据集的算子结果选择方案,在多种清洗任务下实现了多种清洗方法配合的高效数据清洗。在所提清洗模型中,算子库提供数据清洗结果并将其转换为中间算子;中游的采样器将中间算子集分流和剪枝,给搜索器提供优质的候选证据集;下游的搜索器在质量评估器的指导下进行证据集的选择,搜索完毕后向上游算子库更新数据和必要的参数,使算子库重新迭代生成中间算子。最后,基于3个不同规模的真实数据集进行了大量实验,通过不同数据清洗任务下的性能验证在任意种类的数据清洗需求下算子编排的可行性,并将所提方法和现有的智能数据清洗系统进行性能对比。结果表明,在多种清洗任务中,所提方法在多种数据质量约束、动态和大规模的数据清洗方面具有稳定的准确率和召回率,且同一清洗时间下异常值、规则违反和混合错误的清洗任务性能优于其他智能数据清洗系统15%以上。

关键词: 数据清洗, 数据质量评估, 流水线系统设计, 算子选择, 证据集

Abstract: Due to the limitations of data cleaning algorithms designed specifically for individual data quality issues and their inability to effectively address multiple coexisting data quality enhancement requirements,a collaborative approach employing multiple data cleaning methods can be adopted to fulfill various data cleaning needs.This paper formulates the data cleaning problem as a task of evidence set generation and selection.By utilizing an incremental quality assessment scheme based on aggregate queries and an operator result selection scheme based on intermediate operator evidence sets,efficient data cleaning involving a combination of diverse cleaning methods is achieved across various cleaning tasks.In the proposed cleaning model,the operator repository yields data cleaning results and transforms them into intermediate operators.The sampler in the midstream module distributes and prunes the set of intermediate operators to provide the searcher with a high-quality candidate evidence set.The downstream searcher,guided by the quality evaluator,selects evidence sets.Upon completion of the search process,the upstream operator repository updates data and necessary parameters,facilitating the reiteration of intermediate operator generation.Finally,extensive experiments are conducted on three real-world datasets of varying scales.Performance verification across different data cleaning tasks demonstrates the feasibility of operator orchestration for any type of data cleaning requirement,underpinning the proposed method’s stable precision and recall in scenarios involving diverse data quality constraints,dynamics,and large-scale data clea-ning.Furthermore,a performance comparison with existing intelligent data cleaning systems reveals that the proposed method outperforms these systems by over 15% in tasks related to outlier detection,rule violations,and mixed errors,all within the same cleaning time.

Key words: Data cleaning, Data quality assessment, Pipeline system design, Operator selection, Evidence set

中图分类号: 

  • TP311
[1]REKATSINAS T,CHU X,ILYAS I F,et al.HoloClean:Holistic Data Repairs with Probabilistic Inference[C]//Proceedings of the VLDB Endowment.2017.
[2]SINGH P.Systematic review of data-centric approaches in artificial intelligence and machine learning[J].Data Science and Ma-nagement,2023,6(3):144-157.
[3]PAASCHE S,GROPPE S.Enhancing data quality and process optimization for smart manufacturing lines in industry 4.0 scenarios[C]//Proceedings of The International Workshop on Big Data in Emergent Distributed Environments.2022:1-7.
[4]HAO S,LI G L,FENG J H,et al.Survey of structured datacleaning methods[J].Journal of Tsinghua University(Science and Technology),2018,58(12):1037-1050.
[5]ILYAS I F.Effective Data cleaning with Continuous Evaluation[J].IEEE Data Engineering Bulletin,2016,39(2):38-46.
[6]LI H,TANG B,LU H,et al.Spatial data quality in the iot era:management and exploitation[C]//Proceedings of the 2022 International Conference on Management of Data.2022:2474-2482.
[7]KRISHNAN S,HAAS D,FRANKLIN M J,et al.Towards relia-ble interactive data cleaning:A user survey and recommendations[C]//Proceedings of the Workshop on Human-In-the-Loop Data Analytics.2016:1-5.
[8]GUO Z,ZHOU A Y.Researchon data quality and data clea-ning:a survey[J].Journal of software,2002,13(11):2076-2082.
[9]DING X O,WANG H Z,ZHANG X Y,et al.Association relationships study of multi-dimensional data quality[J].Journal of Software,2016,27(7):1626-1644.
[10]KRISHNAN S,WU E.Alphaclean:Automatic generation of data cleaning pipelines[J].arXiv:1904.11827,2019.
[11]KRISHNAN S,FRANKLIN M J,GOLDBERG K,et al.Boostclean:Automated error detection and repair for machine learning[J].arXiv:1711.01299,2017.
[12]ABEDJAN Z,CHU X,DENG D,et al.Detecting data errors:Where are we and what needs to be done?[J].Proceedings of the VLDB Endowment,2016,9(12):993-1004.
[13]FARIHA A,TIWARI A,MELIOU A,et al.Coco:Interactiveexploration of conformance constraints for data understanding and data cleaning[C]//Proceedings of the 2021 International Conference on Management of Data.2021:2706-2710.
[14]QAHTAN A,TANG N,OUZZANI M,et al.Pattern functional dependencies for data cleaning[C]//Proceedings of the VLDB Endowment.2020.
[15]KRISHNAN S,WANG J,WU E,et al.Activeclean:Interactive data cleaning for statistical modeling[J].Proceedings of the VLDB Endowment,2016,9(12):948-959.
[16]XI Y,WANG N,CHEN X,et al.EasyDR:a human-in-the-loop error detection&repair platform for holistic table cleaning[J].Proceedings of the VLDB Endowment,2022,15(12):3578-3581.
[17]DE SA C,ILYAS I F,KIMELFELD B,et al.A Formal Framework for Probabilistic Unclean Databases[C]//22nd International Conference on Database Theory.2019.
[18]REZIG E K,OUZZANI M,AREF W G,et al.Horizon:scalable dependency-driven data cleaning[J].Proceedings of the VLDB Endowment,2021,14(11):2546-2554.
[19]MAHDAVI M,ABEDJAN Z.Baran:Effective error correctionvia a unified context representation and transfer learning[J].Proceedings of the VLDB Endowment,2020,13(12):1948-1961.
[20]PENG J,SHEN D,TANG N,et al.Self-supervised and Inter-pretable Data Cleaning with Sequence Generative Adversarial Networks[J].Proceedings of the VLDB Endowment,2022,16(3):433-446.
[21]ILYAS I F,CHU X.Data Cleaning[M].Morgan & Claypool,2019:49-54.
[22]MARQUES F S L.Discovering Denial Constraints Using Boo-lean Patterns[C]//Companion of the 2023 International Confe-rence on Management of Data.2023:281-283.
[23]RAY B,GHOSH S,AHMED S,et al.Outlier detection using an ensemble of clustering algorithms[J].Multimedia Tools and Applications,2022,81(2):2681-2709.
[24]LI X,DONG X L,LYONS K,et al.Truth Finding on the Deep Web:Is the Problem Solved?[C]//Proceedings of the VLDB Endowment.2012.
[25]LEWIS M,LIU Y,GOYAL N,et al.BART:Denoising Se-quence-to-Sequence Pre-training for Natural Language Generation,Translation,and Comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:7871-7880.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!