Computer Science ›› 2025, Vol. 52 ›› Issue (11A): 241100156-8.doi: 10.11896/jsjkx.241100156

• Big Data & Data Science • Previous Articles     Next Articles

Comparative Study of Missing Value Imputation Methods from Perspective of Interpretability

LI Yi1, WANG Tongxin2, PANG Bozhong1   

  1. 1 School of Information,Shanxi University of Finance and Economics,Taiyuan 030006,China
    2 School of Statistics,Shanxi University of Finance and Economics,Taiyuan 030006,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    Humanities and Social Sciences Research Planning Fund Project of the Ministry of Education(20YJA910004),National Statistical Science Research Program Project(2022LZ14),Key Project of Science and Technology Activities for Scholarly Exchange Students in Shanxi Province(20220025),Outstanding Youth Project of Shanxi Basic Research Program(202303021223010) and Major project of statistical science research in Shanxi Province(2024D005).

Abstract: With the widespread application of deep learning,high-quality tabular data is crucial for model performance.However,missing values can severely disrupt the underlying data structure and distribution.Although numerous imputation methods exist,current research predominantly focuses on imputation accuracy,lacking a systematic evaluation of how imputation outcomes affect the interpretability of downstream models.This paper proposes a framework for evaluating missing value imputation methods from the perspective of model interpretability.Firstly,it explores the advantages of deep generative models in learning complex data distributions to generate high-quality imputed values.Next,it constructs various missing data scenarios and employs Shapley values as a core metric to quantitatively compare the impact of different imputation methods on model feature importance explanations.Experimental results demonstrate that:1)Deep generative models can effectively learn the sample distribution and excel at preserving data structure and informational integrity.2)There is no direct correlation between imputation accuracy and the stability of model explanations;the choice of imputation method significantly alters the final Shapley values.3)As the proportion of missing data increases,the differential impact of various imputation methods on model interpretability becomes more pronounced.This study reveals the latent impact of missing value imputation on model interpretability and provides empirical evidence and a new evaluation perspective for selecting appropriate imputation strategies in interpretability-critical scenarios.

Key words: Table data, Missing data, Deep generative model, Shapley value

CLC Number: 

  • TP311.13
[1]WEN Y Z,WANG Y,YI K,et al.Diffimpute:Tabular data imputation with denoising diffusion probabilistic model[C]//2024 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2024.
[2]MCKNIGHT P E,MCKNIGHT K M,SIDANI S,et al.Missing Data:A Gentle Introduction [M].New York: Guilford Press,2007.
[3]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[4]ZHAO Z L,KUNAR A,BIRKE R,et al.CTAB-GAN:Effective Table Data Synthesizing [C]//Asian Conference on Machine Learning.PMLR,2021:97-112.
[5]AWAN S E,BENNAMOUN M,SOHEL F,et al.Imputation of missing data with class imbalance using conditional generative adversarialnetworks[J].Neurocomputing,2021,453:164-171.
[6]SU J,YU H.Missing data imputation algorithm with dual discriminators based on conditional generative adversarial imputation network [J].Journal of Computer Applications,2024,44(5):1423-1427.
[7]TRAN L,LIU X,ZHOU J,et al.Missing modalities imputation via cascaded residual autoencoder[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2017:4971-4980.
[8]HO J,AJAY J,PIETER A.Denoising diffusion probabilisticmodels [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.New York:ACM,2020:6840-6851.
[9]GONDARA L,WANG K.MIDA:Multiple Imputation UsingDenoising Autoencoders [C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Cham:Springer International Publishing,2018:260-272.
[10]CHEN Z,LI H,WANG F,et al.Rethinking the diffusion models for numerical tabular data imputation from the perspective of wasserstein gradient flow[J].arXiv:2406.15762,2024.
[11]KOTELNIKOV A,BARANCHUK D,RUBACHEV I,et al.Tabddpm:Modelling tabular data with diffusion models[C]//International Conference on Machine Learning.PMLR,2023:17564-17579.
[12]SUNDARARAJAN M,NAJMI A.The many SHAPlEY values for model explanation[C]//International Conference on Machine Learning.PMLR,2020:9269-9278.
[13]VAN BUUREN S.Flexible Imputation of Missing Data [M].Boca Raton:CRC Press,2012.
[14]LI S C X,JIANG B,MARLIN B.Misgan:Learning from incomplete data with generative adversarial networks[J].arXiv:1902.09599,2019.
[15]YOON S,SULL S.GAMIN:Generative Adversarial MultipleImputation Network for Highly Missing Data [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:8456-8464.
[16]NELOY A A,TURGEON M.A comprehensive study of auto-encoders for anomaly detection:Efficiency and trade-offs[J].Machine Learning with Applications,2024:100572.
[17]FANG F,BAO S.FragmGAN:Generative Adversarial Nets for Fragmentary Data Imputation and Prediction [J].Statistical Theory and Related Fields,2023,8(1):1-14.
[18]YOON J,JORDON J,SCHAAR M.GAIN:Missing Data Imputation Using Generative Adversarial Nets [C]//International Conference on Machine Learning.PMLR,2018.
[19]WANG Y,XU X,HU L,et al.A Time Series Continuous Missing Values Imputation Method Based on Generative Adversarial Networks [J].Knowledge-Based Systems,2024,283:111215.
[20]ZHENG S,CHAROENPHAKDEE N.Diffusion models formissing value imputation in tabular data[J].arXiv:2210.17128,2022.
[21]CHEN H,COVERT I C,LUNDBERG S M,et al.Algorithms toEstimate Shapley Value Feature Attributions [J].Nature Machine Intelligence,2023,5(6):590-601.
[22]JADHAV A,PRAMOD D,RAMANATHAN K.Comparison of Performance of Data Imputation Methods for Numeric Dataset [J].Applied Artificial Intelligence,2019,33(10):913-933.
[1] ZHANG Yuechao, AN Guocheng, SUN Chenkai. Prediction of Short-and-Medium Term Photovoltaic Power Generation Based on Improved ModernTCN [J]. Computer Science, 2025, 52(11A): 241000164-7.
[2] HUANG Kun, SUN Weiwei. Traffic Speed Forecasting Algorithm Based on Missing Data [J]. Computer Science, 2024, 51(3): 72-80.
[3] CAI Qiquan, LU Juhong, YU Zhiyong, HUANG Fangwan. Data Completion of Air Quality Index Based on Multi-dimensional Sparse Representation [J]. Computer Science, 2023, 50(8): 52-57.
[4] LI Pei-guan, YU Zhi-yong, HUANG Fang-wan. Power Load Data Completion Based on Sparse Representation [J]. Computer Science, 2021, 48(2): 128-133.
[5] ZHANG Wang-ce, FAN Jing, WANG Bo-ru and NI Min. (α,k)-anonymized Model for Missing Data [J]. Computer Science, 2020, 47(6A): 395-399.
[6] SONG Xiao-xiang,GUO Yan,LI Ning,YU Dong-ping. Missing Data Prediction Algorithm Based on Sparse Bayesian Learning in Coevolving Time Series [J]. Computer Science, 2019, 46(7): 217-223.
[7] SONG Xiao-xiang, GUO Yan, LI Ning, WANG Meng. Missing Data Prediction Based on Compressive Sensing in Time Series [J]. Computer Science, 2019, 46(6): 35-40.
[8] FAN Zhe-ning, YANG Qiu-hui, ZHAI Yu-peng, WAN Ying, WANG Shuai. Improved ROUSTIDA Algorithm for Missing Data Imputation with Key Attribute in Repetitive Data [J]. Computer Science, 2019, 46(2): 30-34.
[9] WANG Feng WEI Wei. Group Feature Selection Algorithm for Data Sets with Missing Data [J]. Computer Science, 2015, 42(7): 285-290.
[10] . Utility Allocation Strategy for Virtualized Resource Based on Cooperative Game [J]. Computer Science, 2012, 39(6): 51-53.
[11] PENG Hong-Yi, ZHU Si-Ming, JIANG Chun-Fu (Department of Mathematics, Sun Yat-sen University, Guangahou 510275). [J]. Computer Science, 2005, 32(12): 203-205.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!