计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241100156-8.doi: 10.11896/jsjkx.241100156
李毅1, 王童欣2, 庞博中1
LI Yi1, WANG Tongxin2, PANG Bozhong1
摘要: 随着深度学习技术的广泛应用,高质量的表格数据对模型预测性能至关重要,而数据缺失会严重破坏其内在结构与分布。尽管缺失值填补方法众多,但现有研究多侧重于填补精度,缺乏对填补结果如何影响下游模型可解释性的系统性评估。文中提出一种基于模型可解释性的缺失值填补评估框架。首先,探讨了深度生成模型在学习复杂数据分布以生成高质量填补值方面的优势。其次,构建了多种缺失场景,并采用夏普利值(Shapley Value)作为核心度量,量化比较了不同填补方法对模型特征重要性解释的影响。实验结果表明:1)深度生成模型能有效学习样本分布,其填补值在保持数据结构与信息完整性方面表现优越;2)填补精度与模型解释的稳定性之间并无直接对应关系,填补方法的选择会显著改变最终的夏普利值;3)随着数据缺失比例的增加,不同填补方法对模型解释结果的差异性影响愈发显著。本研究揭示了缺失值填补对模型可解释性的潜在影响,并为在可解释性攸关的场景中选择合适的填补策略提供了实证依据和新的评估视角。
中图分类号:
| [1]WEN Y Z,WANG Y,YI K,et al.Diffimpute:Tabular data imputation with denoising diffusion probabilistic model[C]//2024 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2024. [2]MCKNIGHT P E,MCKNIGHT K M,SIDANI S,et al.Missing Data:A Gentle Introduction [M].New York: Guilford Press,2007. [3]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[J].Communications of the ACM,2020,63(11):139-144. [4]ZHAO Z L,KUNAR A,BIRKE R,et al.CTAB-GAN:Effective Table Data Synthesizing [C]//Asian Conference on Machine Learning.PMLR,2021:97-112. [5]AWAN S E,BENNAMOUN M,SOHEL F,et al.Imputation of missing data with class imbalance using conditional generative adversarialnetworks[J].Neurocomputing,2021,453:164-171. [6]SU J,YU H.Missing data imputation algorithm with dual discriminators based on conditional generative adversarial imputation network [J].Journal of Computer Applications,2024,44(5):1423-1427. [7]TRAN L,LIU X,ZHOU J,et al.Missing modalities imputation via cascaded residual autoencoder[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2017:4971-4980. [8]HO J,AJAY J,PIETER A.Denoising diffusion probabilisticmodels [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.New York:ACM,2020:6840-6851. [9]GONDARA L,WANG K.MIDA:Multiple Imputation UsingDenoising Autoencoders [C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Cham:Springer International Publishing,2018:260-272. [10]CHEN Z,LI H,WANG F,et al.Rethinking the diffusion models for numerical tabular data imputation from the perspective of wasserstein gradient flow[J].arXiv:2406.15762,2024. [11]KOTELNIKOV A,BARANCHUK D,RUBACHEV I,et al.Tabddpm:Modelling tabular data with diffusion models[C]//International Conference on Machine Learning.PMLR,2023:17564-17579. [12]SUNDARARAJAN M,NAJMI A.The many SHAPlEY values for model explanation[C]//International Conference on Machine Learning.PMLR,2020:9269-9278. [13]VAN BUUREN S.Flexible Imputation of Missing Data [M].Boca Raton:CRC Press,2012. [14]LI S C X,JIANG B,MARLIN B.Misgan:Learning from incomplete data with generative adversarial networks[J].arXiv:1902.09599,2019. [15]YOON S,SULL S.GAMIN:Generative Adversarial MultipleImputation Network for Highly Missing Data [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:8456-8464. [16]NELOY A A,TURGEON M.A comprehensive study of auto-encoders for anomaly detection:Efficiency and trade-offs[J].Machine Learning with Applications,2024:100572. [17]FANG F,BAO S.FragmGAN:Generative Adversarial Nets for Fragmentary Data Imputation and Prediction [J].Statistical Theory and Related Fields,2023,8(1):1-14. [18]YOON J,JORDON J,SCHAAR M.GAIN:Missing Data Imputation Using Generative Adversarial Nets [C]//International Conference on Machine Learning.PMLR,2018. [19]WANG Y,XU X,HU L,et al.A Time Series Continuous Missing Values Imputation Method Based on Generative Adversarial Networks [J].Knowledge-Based Systems,2024,283:111215. [20]ZHENG S,CHAROENPHAKDEE N.Diffusion models formissing value imputation in tabular data[J].arXiv:2210.17128,2022. [21]CHEN H,COVERT I C,LUNDBERG S M,et al.Algorithms toEstimate Shapley Value Feature Attributions [J].Nature Machine Intelligence,2023,5(6):590-601. [22]JADHAV A,PRAMOD D,RAMANATHAN K.Comparison of Performance of Data Imputation Methods for Numeric Dataset [J].Applied Artificial Intelligence,2019,33(10):913-933. |
|
||