计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241100156-8.doi: 10.11896/jsjkx.241100156

• 数据库&大数据&数据科学 • 上一篇    下一篇

可解释性视角下缺失值填补方法比较研究

李毅1, 王童欣2, 庞博中1   

  1. 1 山西财经大学信息学院 太原 030006
    2 山西财经大学统计学院 太原 030006
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 李毅(liyi@sxufe.edu.cn)
  • 基金资助:
    教育部人文社会科学研究规划基金项目(20YJA910004);全国统计科学研究计划项目(2022LZ14);山西省留学人员科技活动择优资助重点项目(20220025);山西省基础研究计划杰青项目(202303021223010);山西省统计科学研究重大项目(2024D005)

Comparative Study of Missing Value Imputation Methods from Perspective of Interpretability

LI Yi1, WANG Tongxin2, PANG Bozhong1   

  1. 1 School of Information,Shanxi University of Finance and Economics,Taiyuan 030006,China
    2 School of Statistics,Shanxi University of Finance and Economics,Taiyuan 030006,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    Humanities and Social Sciences Research Planning Fund Project of the Ministry of Education(20YJA910004),National Statistical Science Research Program Project(2022LZ14),Key Project of Science and Technology Activities for Scholarly Exchange Students in Shanxi Province(20220025),Outstanding Youth Project of Shanxi Basic Research Program(202303021223010) and Major project of statistical science research in Shanxi Province(2024D005).

摘要: 随着深度学习技术的广泛应用,高质量的表格数据对模型预测性能至关重要,而数据缺失会严重破坏其内在结构与分布。尽管缺失值填补方法众多,但现有研究多侧重于填补精度,缺乏对填补结果如何影响下游模型可解释性的系统性评估。文中提出一种基于模型可解释性的缺失值填补评估框架。首先,探讨了深度生成模型在学习复杂数据分布以生成高质量填补值方面的优势。其次,构建了多种缺失场景,并采用夏普利值(Shapley Value)作为核心度量,量化比较了不同填补方法对模型特征重要性解释的影响。实验结果表明:1)深度生成模型能有效学习样本分布,其填补值在保持数据结构与信息完整性方面表现优越;2)填补精度与模型解释的稳定性之间并无直接对应关系,填补方法的选择会显著改变最终的夏普利值;3)随着数据缺失比例的增加,不同填补方法对模型解释结果的差异性影响愈发显著。本研究揭示了缺失值填补对模型可解释性的潜在影响,并为在可解释性攸关的场景中选择合适的填补策略提供了实证依据和新的评估视角。

关键词: 表格数据, 缺失数据, 深度生成模型, Shapley值

Abstract: With the widespread application of deep learning,high-quality tabular data is crucial for model performance.However,missing values can severely disrupt the underlying data structure and distribution.Although numerous imputation methods exist,current research predominantly focuses on imputation accuracy,lacking a systematic evaluation of how imputation outcomes affect the interpretability of downstream models.This paper proposes a framework for evaluating missing value imputation methods from the perspective of model interpretability.Firstly,it explores the advantages of deep generative models in learning complex data distributions to generate high-quality imputed values.Next,it constructs various missing data scenarios and employs Shapley values as a core metric to quantitatively compare the impact of different imputation methods on model feature importance explanations.Experimental results demonstrate that:1)Deep generative models can effectively learn the sample distribution and excel at preserving data structure and informational integrity.2)There is no direct correlation between imputation accuracy and the stability of model explanations;the choice of imputation method significantly alters the final Shapley values.3)As the proportion of missing data increases,the differential impact of various imputation methods on model interpretability becomes more pronounced.This study reveals the latent impact of missing value imputation on model interpretability and provides empirical evidence and a new evaluation perspective for selecting appropriate imputation strategies in interpretability-critical scenarios.

Key words: Table data, Missing data, Deep generative model, Shapley value

中图分类号: 

  • TP311.13
[1]WEN Y Z,WANG Y,YI K,et al.Diffimpute:Tabular data imputation with denoising diffusion probabilistic model[C]//2024 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2024.
[2]MCKNIGHT P E,MCKNIGHT K M,SIDANI S,et al.Missing Data:A Gentle Introduction [M].New York: Guilford Press,2007.
[3]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[4]ZHAO Z L,KUNAR A,BIRKE R,et al.CTAB-GAN:Effective Table Data Synthesizing [C]//Asian Conference on Machine Learning.PMLR,2021:97-112.
[5]AWAN S E,BENNAMOUN M,SOHEL F,et al.Imputation of missing data with class imbalance using conditional generative adversarialnetworks[J].Neurocomputing,2021,453:164-171.
[6]SU J,YU H.Missing data imputation algorithm with dual discriminators based on conditional generative adversarial imputation network [J].Journal of Computer Applications,2024,44(5):1423-1427.
[7]TRAN L,LIU X,ZHOU J,et al.Missing modalities imputation via cascaded residual autoencoder[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2017:4971-4980.
[8]HO J,AJAY J,PIETER A.Denoising diffusion probabilisticmodels [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.New York:ACM,2020:6840-6851.
[9]GONDARA L,WANG K.MIDA:Multiple Imputation UsingDenoising Autoencoders [C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Cham:Springer International Publishing,2018:260-272.
[10]CHEN Z,LI H,WANG F,et al.Rethinking the diffusion models for numerical tabular data imputation from the perspective of wasserstein gradient flow[J].arXiv:2406.15762,2024.
[11]KOTELNIKOV A,BARANCHUK D,RUBACHEV I,et al.Tabddpm:Modelling tabular data with diffusion models[C]//International Conference on Machine Learning.PMLR,2023:17564-17579.
[12]SUNDARARAJAN M,NAJMI A.The many SHAPlEY values for model explanation[C]//International Conference on Machine Learning.PMLR,2020:9269-9278.
[13]VAN BUUREN S.Flexible Imputation of Missing Data [M].Boca Raton:CRC Press,2012.
[14]LI S C X,JIANG B,MARLIN B.Misgan:Learning from incomplete data with generative adversarial networks[J].arXiv:1902.09599,2019.
[15]YOON S,SULL S.GAMIN:Generative Adversarial MultipleImputation Network for Highly Missing Data [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:8456-8464.
[16]NELOY A A,TURGEON M.A comprehensive study of auto-encoders for anomaly detection:Efficiency and trade-offs[J].Machine Learning with Applications,2024:100572.
[17]FANG F,BAO S.FragmGAN:Generative Adversarial Nets for Fragmentary Data Imputation and Prediction [J].Statistical Theory and Related Fields,2023,8(1):1-14.
[18]YOON J,JORDON J,SCHAAR M.GAIN:Missing Data Imputation Using Generative Adversarial Nets [C]//International Conference on Machine Learning.PMLR,2018.
[19]WANG Y,XU X,HU L,et al.A Time Series Continuous Missing Values Imputation Method Based on Generative Adversarial Networks [J].Knowledge-Based Systems,2024,283:111215.
[20]ZHENG S,CHAROENPHAKDEE N.Diffusion models formissing value imputation in tabular data[J].arXiv:2210.17128,2022.
[21]CHEN H,COVERT I C,LUNDBERG S M,et al.Algorithms toEstimate Shapley Value Feature Attributions [J].Nature Machine Intelligence,2023,5(6):590-601.
[22]JADHAV A,PRAMOD D,RAMANATHAN K.Comparison of Performance of Data Imputation Methods for Numeric Dataset [J].Applied Artificial Intelligence,2019,33(10):913-933.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!