计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 124-132.doi: 10.11896/jsjkx.230900003
钱泽凯, 丁小欧, 孙哲, 王宏志, 张岩
QIAN Zekai, DING Xiaoou, SUN Zhe, WANG Hongzhi, ZHANG Yan
摘要: 由于针对单一特定数据质量问题而设计的数据清洗算法并不总能有效地适用于多种清洗需求共存的数据质量提升技术,因此可采用多种清洗方法互相配合的方式来解决各种数据清洗需求。将数据清洗问题转换为证据集的生成和选择问题,基于聚合查询的增量式质量评估方案和基于中间算子证据集的算子结果选择方案,在多种清洗任务下实现了多种清洗方法配合的高效数据清洗。在所提清洗模型中,算子库提供数据清洗结果并将其转换为中间算子;中游的采样器将中间算子集分流和剪枝,给搜索器提供优质的候选证据集;下游的搜索器在质量评估器的指导下进行证据集的选择,搜索完毕后向上游算子库更新数据和必要的参数,使算子库重新迭代生成中间算子。最后,基于3个不同规模的真实数据集进行了大量实验,通过不同数据清洗任务下的性能验证在任意种类的数据清洗需求下算子编排的可行性,并将所提方法和现有的智能数据清洗系统进行性能对比。结果表明,在多种清洗任务中,所提方法在多种数据质量约束、动态和大规模的数据清洗方面具有稳定的准确率和召回率,且同一清洗时间下异常值、规则违反和混合错误的清洗任务性能优于其他智能数据清洗系统15%以上。
中图分类号:
[1]REKATSINAS T,CHU X,ILYAS I F,et al.HoloClean:Holistic Data Repairs with Probabilistic Inference[C]//Proceedings of the VLDB Endowment.2017. [2]SINGH P.Systematic review of data-centric approaches in artificial intelligence and machine learning[J].Data Science and Ma-nagement,2023,6(3):144-157. [3]PAASCHE S,GROPPE S.Enhancing data quality and process optimization for smart manufacturing lines in industry 4.0 scenarios[C]//Proceedings of The International Workshop on Big Data in Emergent Distributed Environments.2022:1-7. [4]HAO S,LI G L,FENG J H,et al.Survey of structured datacleaning methods[J].Journal of Tsinghua University(Science and Technology),2018,58(12):1037-1050. [5]ILYAS I F.Effective Data cleaning with Continuous Evaluation[J].IEEE Data Engineering Bulletin,2016,39(2):38-46. [6]LI H,TANG B,LU H,et al.Spatial data quality in the iot era:management and exploitation[C]//Proceedings of the 2022 International Conference on Management of Data.2022:2474-2482. [7]KRISHNAN S,HAAS D,FRANKLIN M J,et al.Towards relia-ble interactive data cleaning:A user survey and recommendations[C]//Proceedings of the Workshop on Human-In-the-Loop Data Analytics.2016:1-5. [8]GUO Z,ZHOU A Y.Researchon data quality and data clea-ning:a survey[J].Journal of software,2002,13(11):2076-2082. [9]DING X O,WANG H Z,ZHANG X Y,et al.Association relationships study of multi-dimensional data quality[J].Journal of Software,2016,27(7):1626-1644. [10]KRISHNAN S,WU E.Alphaclean:Automatic generation of data cleaning pipelines[J].arXiv:1904.11827,2019. [11]KRISHNAN S,FRANKLIN M J,GOLDBERG K,et al.Boostclean:Automated error detection and repair for machine learning[J].arXiv:1711.01299,2017. [12]ABEDJAN Z,CHU X,DENG D,et al.Detecting data errors:Where are we and what needs to be done?[J].Proceedings of the VLDB Endowment,2016,9(12):993-1004. [13]FARIHA A,TIWARI A,MELIOU A,et al.Coco:Interactiveexploration of conformance constraints for data understanding and data cleaning[C]//Proceedings of the 2021 International Conference on Management of Data.2021:2706-2710. [14]QAHTAN A,TANG N,OUZZANI M,et al.Pattern functional dependencies for data cleaning[C]//Proceedings of the VLDB Endowment.2020. [15]KRISHNAN S,WANG J,WU E,et al.Activeclean:Interactive data cleaning for statistical modeling[J].Proceedings of the VLDB Endowment,2016,9(12):948-959. [16]XI Y,WANG N,CHEN X,et al.EasyDR:a human-in-the-loop error detection&repair platform for holistic table cleaning[J].Proceedings of the VLDB Endowment,2022,15(12):3578-3581. [17]DE SA C,ILYAS I F,KIMELFELD B,et al.A Formal Framework for Probabilistic Unclean Databases[C]//22nd International Conference on Database Theory.2019. [18]REZIG E K,OUZZANI M,AREF W G,et al.Horizon:scalable dependency-driven data cleaning[J].Proceedings of the VLDB Endowment,2021,14(11):2546-2554. [19]MAHDAVI M,ABEDJAN Z.Baran:Effective error correctionvia a unified context representation and transfer learning[J].Proceedings of the VLDB Endowment,2020,13(12):1948-1961. [20]PENG J,SHEN D,TANG N,et al.Self-supervised and Inter-pretable Data Cleaning with Sequence Generative Adversarial Networks[J].Proceedings of the VLDB Endowment,2022,16(3):433-446. [21]ILYAS I F,CHU X.Data Cleaning[M].Morgan & Claypool,2019:49-54. [22]MARQUES F S L.Discovering Denial Constraints Using Boo-lean Patterns[C]//Companion of the 2023 International Confe-rence on Management of Data.2023:281-283. [23]RAY B,GHOSH S,AHMED S,et al.Outlier detection using an ensemble of clustering algorithms[J].Multimedia Tools and Applications,2022,81(2):2681-2709. [24]LI X,DONG X L,LYONS K,et al.Truth Finding on the Deep Web:Is the Problem Solved?[C]//Proceedings of the VLDB Endowment.2012. [25]LEWIS M,LIU Y,GOYAL N,et al.BART:Denoising Se-quence-to-Sequence Pre-training for Natural Language Generation,Translation,and Comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:7871-7880. |
|