基于Hellinger距离的混合数据集中分类变量相似度分析

doi:10.11896/j.issn.1002-137X.2016.06.055

Abstract

Abstract: Similarity analysis of categorical variables is an important part of data mining.The traditional methods have the defects of neglecting the difference between categorical variables,which are seriously affected by unbalanced dataset and can not be used in mixture dataset.To overcome the shortcomings mentioned above,this paper proposed an algorithm to measure the similarity between categorical variables based on the Hellinger distance.It accumulates the distribution differences of variables with different attributes in subsets corresponding to categorical variables as similarity variables and fits for mixture dataset.The experiments which use the derived similarity metrics in clustering algorithm and apply UCI datasets show that there is significant improvement in accuracy,validity and stability.

Key words: Categorical variables,Similarity,f-divergence,Hellinger distance

ZHAO Liang, LIU Jian-Hui and WANG Xing. Hellinger Distance Based Similarity Analysis for Categorical Variables in Mixture Dataset[J].Computer Science, 2016, 43(6): 280-282.

References

[1] Han J,Kamber M,Pei J.Data mining:Concepts and Techniques [J].Data Mining Concepts Models Methods & Algorithms Se-cond Edition,2000,5(4):1-18
[2] Anderberg M R.Cluster Analysis for Applications[M]∥Probability and Mathematical Statistics:A Serices of Monographs and Textbooks.1973:ibc1-ibc2
[3] Gan G,Ma C,Wu J.Data clustering:theory,algorithms,and applications[M]∥ Data Clustering:theory,algorithms,and applications.Society for Industrial and Applied Mathematics,American Statistical Association,2007:44-51
[4] Hanneman R A,Riddle M.Introduction to social network methods[D].Department of Sociology,University of California Ri-verside,2005
[5] Boriah S,Chandola V,Kumar V.Similarity measures for categorical data:A comparative evaluation [J].Proceedings of the 2008 SIAM International Conference on Data Mining,2008,30(2):243-254
[6] Huang Z.A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining[C]∥DMKD.1998:1-8
[7] Stanfill C,Waltz D.Toward memory-based reasoning [J].Communications of the ACM,1986,29(12):1213-1228
[8] Cost S,Salzberg S.A weighted nearest neighbor algorithm for learning with symbolic features[J].Machine Learning,1993,10(1):57-78
[9] Wilson D R,Martinez T R.Improved heterogeneous distancefunctions [J].Journal of Artificial Intelligence Research,1997,6:1-34
[10] Ahmad A,Dey L.A k-mean clustering algorithm for mixed numeric and categorical data [J].Data & Knowledge Engineering,2007,63(2):503-527
[11] Wang C,Cao L,Wang M,et al.Coupled nominal similarity inunsupervised learning [C]∥Proceedings of the 20th ACM International Conference on Information and Knowledge Management.ACM,2011:973-978
[12] Liang J Y,Bai L,Cao F Y.K-Modes Clustering Algorithm Based on a New Distance Measure [J].Journal of Computer Research and Development,2010,47(10):1749-1755(in Chinese) 梁吉业,白亮,曹付元.基于新的距离度量的 K-Modes 聚类算法 [J].计算机研究与发展,2010,47(10):1749-1755
[13] Cao F,Liang J,Li D,et al.A dissimilarity measure for the k-Modes clustering algorithm[J].Knowledge-Based Systems,2012,26:120-127
[14] Csiszar I.Information-type measures of difference of probability distributions and indirect observations[M].Studia Sci.Math.Hungar.,1967:299-318
[15] Morimoto T.Markov processes and the H-theorem[J].Journal of the Physical Society of Japan,1963,18(3):328-331
[16] Ali S M,Silvey S D.A general class of coefficients of divergence of one distribution from another[J].Journal of the Royal Statistical Society.Series B (Methodological),1966,8(1):131-142

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Hellinger Distance Based Similarity Analysis for Categorical Variables in Mixture Dataset

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0