计算机科学 ›› 2011, Vol. 38 ›› Issue (5): 149-153.

• 数据库与数据挖掘 • 上一篇    下一篇

基于广义马氏距离的缺损数据补值算法

陈欢,黄德才   

  1. (浙江工业大学计算机学院 杭州310023)
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受浙江省自然科学基金项目(Y105118)资助.

Missing Data Imputation Based on Generalized Mahalanobis Distance

CHEN Huan,HUANG Der-cai   

  • Online:2018-11-16 Published:2018-11-16

摘要: 在数据收集过程中数据缺损是不可避免的。如何还原这些缺损数据,成为数据挖掘研究的热点问题之一。与许多现有算法一样,基于马氏距离的缺损数据补值算法充分利用了实际数据之间的相关性,具有较好的补值效果,但它要求数据的相关性协方差矩阵可逆,使其应用范围受到了极大的限制。在改进传统主成分分析方法的基础上,利用矩阵的奇异值分解理论和Moors Pcnrosc广义逆性质,提出了广义马氏距离的概念,并运用于SOFM神经网络,结合信息嫡理论设计了基于广义马氏距离的缺损数据补值算法—GS算法。理论分析和数值仿真结果表明,广义马氏距离完全继承了马氏距离在处理相关性数据上的性能优势,新算法不仅在补值的精确度和稳定性上有很好的效果,而且适用于任意数据集合。

关键词: 主成分分析,Moore-Penrose伪逆,广义马氏距离,SOFM神经网络,信息熵

Abstract: Missing data arc inevitable in data-collection, how to restore these data has become one of the hottest issues in data mining. Just like most algorithms,missing data imputation algorithms based on Mahalanobis Distance make full use of relationships between data. I}hough the results arc acceptable, the covariance matrixes arc not always reversible, which limit the algorithms greatly. This paper improved a traditional principal component analysis(PCA) method, proposed a new distance named Generalized Mahalanobis Distance according to SVl)and Moore-Penrose pseudoinverse. Combining with SOFM neural network and entropy, we designed GS missing data imputation algorithms. After academic analysis and simulation, it was proved that Generalized Mahalanobis Distance inherits the advantages of Mahalanobis Distance wonderfully in dealing with relatived data. Not only the new algorithm has good accuracy and stability, but also suits for any datascts.

Key words: PCA, Moore-penrose pseudoinverse, Generalized mahalanobis distance, SOFM neural network, Entropy

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!