Computer Science ›› 2024, Vol. 51 ›› Issue (8): 124-132.doi: 10.11896/jsjkx.230900003

• Database & Big Data & Data Science • Previous Articles     Next Articles

Intelligent Evidence Set Selection Method for Diverse Data Cleaning Tasks

QIAN Zekai, DING Xiaoou, SUN Zhe, WANG Hongzhi, ZHANG Yan   

  1. College of Computer Science and Technology,Harbin Institute of Technology,Harbin 150006,China
  • Received:2023-09-01 Revised:2023-12-06 Online:2024-08-15 Published:2024-08-13
  • About author:QIAN Zekai,born in 2001,postgra-duate,is a member of CCF(No.P8213G).His main research interests include data governance and so on.
    WANG Hongzhi,born in 1978,Ph.D,professor,is a member of CCF(No.07132D).His main research interests include databases,big data management and analysis,and big data governance.
  • Supported by:
    National Key Research and Development Program of China(2021YFB3300502),National Natural Science Foundation of China(62232005,62202126),China Postdoctoral Science Foundation(2022M720957) and Heilongjiang Postdoctoral Financial Assistance Program(LBH-Z21137).

Abstract: Due to the limitations of data cleaning algorithms designed specifically for individual data quality issues and their inability to effectively address multiple coexisting data quality enhancement requirements,a collaborative approach employing multiple data cleaning methods can be adopted to fulfill various data cleaning needs.This paper formulates the data cleaning problem as a task of evidence set generation and selection.By utilizing an incremental quality assessment scheme based on aggregate queries and an operator result selection scheme based on intermediate operator evidence sets,efficient data cleaning involving a combination of diverse cleaning methods is achieved across various cleaning tasks.In the proposed cleaning model,the operator repository yields data cleaning results and transforms them into intermediate operators.The sampler in the midstream module distributes and prunes the set of intermediate operators to provide the searcher with a high-quality candidate evidence set.The downstream searcher,guided by the quality evaluator,selects evidence sets.Upon completion of the search process,the upstream operator repository updates data and necessary parameters,facilitating the reiteration of intermediate operator generation.Finally,extensive experiments are conducted on three real-world datasets of varying scales.Performance verification across different data cleaning tasks demonstrates the feasibility of operator orchestration for any type of data cleaning requirement,underpinning the proposed method’s stable precision and recall in scenarios involving diverse data quality constraints,dynamics,and large-scale data clea-ning.Furthermore,a performance comparison with existing intelligent data cleaning systems reveals that the proposed method outperforms these systems by over 15% in tasks related to outlier detection,rule violations,and mixed errors,all within the same cleaning time.

Key words: Data cleaning, Data quality assessment, Pipeline system design, Operator selection, Evidence set

CLC Number: 

  • TP311
[1]REKATSINAS T,CHU X,ILYAS I F,et al.HoloClean:Holistic Data Repairs with Probabilistic Inference[C]//Proceedings of the VLDB Endowment.2017.
[2]SINGH P.Systematic review of data-centric approaches in artificial intelligence and machine learning[J].Data Science and Ma-nagement,2023,6(3):144-157.
[3]PAASCHE S,GROPPE S.Enhancing data quality and process optimization for smart manufacturing lines in industry 4.0 scenarios[C]//Proceedings of The International Workshop on Big Data in Emergent Distributed Environments.2022:1-7.
[4]HAO S,LI G L,FENG J H,et al.Survey of structured datacleaning methods[J].Journal of Tsinghua University(Science and Technology),2018,58(12):1037-1050.
[5]ILYAS I F.Effective Data cleaning with Continuous Evaluation[J].IEEE Data Engineering Bulletin,2016,39(2):38-46.
[6]LI H,TANG B,LU H,et al.Spatial data quality in the iot era:management and exploitation[C]//Proceedings of the 2022 International Conference on Management of Data.2022:2474-2482.
[7]KRISHNAN S,HAAS D,FRANKLIN M J,et al.Towards relia-ble interactive data cleaning:A user survey and recommendations[C]//Proceedings of the Workshop on Human-In-the-Loop Data Analytics.2016:1-5.
[8]GUO Z,ZHOU A Y.Researchon data quality and data clea-ning:a survey[J].Journal of software,2002,13(11):2076-2082.
[9]DING X O,WANG H Z,ZHANG X Y,et al.Association relationships study of multi-dimensional data quality[J].Journal of Software,2016,27(7):1626-1644.
[10]KRISHNAN S,WU E.Alphaclean:Automatic generation of data cleaning pipelines[J].arXiv:1904.11827,2019.
[11]KRISHNAN S,FRANKLIN M J,GOLDBERG K,et al.Boostclean:Automated error detection and repair for machine learning[J].arXiv:1711.01299,2017.
[12]ABEDJAN Z,CHU X,DENG D,et al.Detecting data errors:Where are we and what needs to be done?[J].Proceedings of the VLDB Endowment,2016,9(12):993-1004.
[13]FARIHA A,TIWARI A,MELIOU A,et al.Coco:Interactiveexploration of conformance constraints for data understanding and data cleaning[C]//Proceedings of the 2021 International Conference on Management of Data.2021:2706-2710.
[14]QAHTAN A,TANG N,OUZZANI M,et al.Pattern functional dependencies for data cleaning[C]//Proceedings of the VLDB Endowment.2020.
[15]KRISHNAN S,WANG J,WU E,et al.Activeclean:Interactive data cleaning for statistical modeling[J].Proceedings of the VLDB Endowment,2016,9(12):948-959.
[16]XI Y,WANG N,CHEN X,et al.EasyDR:a human-in-the-loop error detection&repair platform for holistic table cleaning[J].Proceedings of the VLDB Endowment,2022,15(12):3578-3581.
[17]DE SA C,ILYAS I F,KIMELFELD B,et al.A Formal Framework for Probabilistic Unclean Databases[C]//22nd International Conference on Database Theory.2019.
[18]REZIG E K,OUZZANI M,AREF W G,et al.Horizon:scalable dependency-driven data cleaning[J].Proceedings of the VLDB Endowment,2021,14(11):2546-2554.
[19]MAHDAVI M,ABEDJAN Z.Baran:Effective error correctionvia a unified context representation and transfer learning[J].Proceedings of the VLDB Endowment,2020,13(12):1948-1961.
[20]PENG J,SHEN D,TANG N,et al.Self-supervised and Inter-pretable Data Cleaning with Sequence Generative Adversarial Networks[J].Proceedings of the VLDB Endowment,2022,16(3):433-446.
[21]ILYAS I F,CHU X.Data Cleaning[M].Morgan & Claypool,2019:49-54.
[22]MARQUES F S L.Discovering Denial Constraints Using Boo-lean Patterns[C]//Companion of the 2023 International Confe-rence on Management of Data.2023:281-283.
[23]RAY B,GHOSH S,AHMED S,et al.Outlier detection using an ensemble of clustering algorithms[J].Multimedia Tools and Applications,2022,81(2):2681-2709.
[24]LI X,DONG X L,LYONS K,et al.Truth Finding on the Deep Web:Is the Problem Solved?[C]//Proceedings of the VLDB Endowment.2012.
[25]LEWIS M,LIU Y,GOYAL N,et al.BART:Denoising Se-quence-to-Sequence Pre-training for Natural Language Generation,Translation,and Comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:7871-7880.
[1] PENG Bo, LI Yaodong, GONG Xianfu. Improved K-means Photovoltaic Energy Data Cleaning Method Based on Autoencoder [J]. Computer Science, 2024, 51(6A): 230700070-5.
[2] WANG Chundong, DU Yingqi, MO Xiuliang, FU Haoran. Enhanced Federated Learning Frameworks Based on CutMix [J]. Computer Science, 2023, 50(11A): 220800021-8.
[3] LIANG Haowei, WANG Shi, CAO Cungen. Study on Short Text Classification with Imperfect Labels [J]. Computer Science, 2023, 50(1): 185-193.
[4] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[5] LIU Zhen-peng, SU Nan, QIN Yi-wen, LU Jia-huan, LI Xiao-fei. FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest [J]. Computer Science, 2020, 47(8): 185-188.
[6] XU He, WU Hao, LI Peng. Design of Temporal-spatial Data Processing Algorithm for IoT [J]. Computer Science, 2020, 47(11): 310-315.
[7] LIU Jin-shuo, LIU Bi-wei, ZHANG Mi, LIU Qing. Fault Prediction of Power Metering Equipment Based on GBDT [J]. Computer Science, 2019, 46(6A): 392-396.
[8] WANG Xiao-xia, SUN De-cai. Q-sample-based Local Similarity Join Parallel Algorithm [J]. Computer Science, 2019, 46(12): 38-44.
[9] SUN De-cai and WANG Xiao-xia. MapReduce Based Similarity Self-join Algorithm for Big Dataset [J]. Computer Science, 2017, 44(5): 20-25.
[10] GU Yun-hua, GAO Bao, ZHANG Jun-yong and DU Jie. RFID Data Cleaning Algorithm Based on Tag Velocity and Sliding Sub-window [J]. Computer Science, 2015, 42(1): 144-148.
[11] WANG Wan-liang,GU Xi-ren and ZHAO Yan-wei. RFID Uncertain Data Cleaning Algorithm Based on Dynamic Tags [J]. Computer Science, 2014, 41(Z6): 383-386.
[12] CHEN Jing-yun,ZHOU Liang and DING Qiu-lin. Cleaning Method Research of RFID Data Stream Based on Improved Kalman Filter [J]. Computer Science, 2014, 41(3): 202-204.
[13] . Data Cleaning and its General System Framework [J]. Computer Science, 2012, 39(Z11): 207-211.
[14] . Realization of Data Cleaning Based on Editing Rules and Master Data [J]. Computer Science, 2012, 39(Z11): 174-176.
[15] CAO Jian-jun,DIAO Xing-chun,WANG Ting,WANG Fang-xiao. Research on Domain-independent Data Cleaning: A Survey [J]. Computer Science, 2010, 37(5): 26-29.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!