计算机科学 ›› 2016, Vol. 43 ›› Issue (6): 280-282.doi: 10.11896/j.issn.1002-137X.2016.06.055

• 人工智能 • 上一篇    下一篇

基于Hellinger距离的混合数据集中分类变量相似度分析

赵亮,刘建辉,王星   

  1. 辽宁工程技术大学研究生学院 阜新123000,辽宁工程技术大学电子与信息工程学院 葫芦岛125000,辽宁工程技术大学电子与信息工程学院 葫芦岛125000
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目:语义Web模糊规则互换与推理关键技术研究(61402212)资助

Hellinger Distance Based Similarity Analysis for Categorical Variables in Mixture Dataset

ZHAO Liang, LIU Jian-Hui and WANG Xing   

  • Online:2018-12-01 Published:2018-12-01

摘要: 分类变量的相似度分析是数据挖掘任务中的一个重要环节,现有的分类变量相似度算法中存在忽视变量差异、受不均衡分布影响严重、无法应用于混合数据集等缺点。为克服以上缺点,提出了一种基于Hellinger距离的分类变量相似度算法。该算法累加分类变量对应子集中不同属性变量的分布差异作为相似度,且支持混合数据集。将所提算法代入聚类算法并应用于UCI公共数据集,结果表明,该算法在准确度、有效性和稳定性上都有较大提高。

关键词: 分类变量,相似度,f散度,Hellinger距离

Abstract: Similarity analysis of categorical variables is an important part of data mining.The traditional methods have the defects of neglecting the difference between categorical variables,which are seriously affected by unbalanced dataset and can not be used in mixture dataset.To overcome the shortcomings mentioned above,this paper proposed an algorithm to measure the similarity between categorical variables based on the Hellinger distance.It accumulates the distribution differences of variables with different attributes in subsets corresponding to categorical variables as similarity variables and fits for mixture dataset.The experiments which use the derived similarity metrics in clustering algorithm and apply UCI datasets show that there is significant improvement in accuracy,validity and stability.

Key words: Categorical variables,Similarity,f-divergence,Hellinger distance

[1] Han J,Kamber M,Pei J.Data mining:Concepts and Techniques [J].Data Mining Concepts Models Methods & Algorithms Se-cond Edition,2000,5(4):1-18
[2] Anderberg M R.Cluster Analysis for Applications[M]∥Probability and Mathematical Statistics:A Serices of Monographs and Textbooks.1973:ibc1-ibc2
[3] Gan G,Ma C,Wu J.Data clustering:theory,algorithms,and applications[M]∥ Data Clustering:theory,algorithms,and applications.Society for Industrial and Applied Mathematics,American Statistical Association,2007:44-51
[4] Hanneman R A,Riddle M.Introduction to social network methods[D].Department of Sociology,University of California Ri-verside,2005
[5] Boriah S,Chandola V,Kumar V.Similarity measures for categorical data:A comparative evaluation [J].Proceedings of the 2008 SIAM International Conference on Data Mining,2008,30(2):243-254
[6] Huang Z.A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining[C]∥DMKD.1998:1-8
[7] Stanfill C,Waltz D.Toward memory-based reasoning [J].Communications of the ACM,1986,29(12):1213-1228
[8] Cost S,Salzberg S.A weighted nearest neighbor algorithm for learning with symbolic features[J].Machine Learning,1993,10(1):57-78
[9] Wilson D R,Martinez T R.Improved heterogeneous distancefunctions [J].Journal of Artificial Intelligence Research,1997,6:1-34
[10] Ahmad A,Dey L.A k-mean clustering algorithm for mixed numeric and categorical data [J].Data & Knowledge Engineering,2007,63(2):503-527
[11] Wang C,Cao L,Wang M,et al.Coupled nominal similarity inunsupervised learning [C]∥Proceedings of the 20th ACM International Conference on Information and Knowledge Management.ACM,2011:973-978
[12] Liang J Y,Bai L,Cao F Y.K-Modes Clustering Algorithm Based on a New Distance Measure [J].Journal of Computer Research and Development,2010,47(10):1749-1755(in Chinese) 梁吉业,白亮,曹付元.基于新的距离度量的 K-Modes 聚类算法 [J].计算机研究与发展,2010,47(10):1749-1755
[13] Cao F,Liang J,Li D,et al.A dissimilarity measure for the k-Modes clustering algorithm[J].Knowledge-Based Systems,2012,26:120-127
[14] Csiszar I.Information-type measures of difference of probability distributions and indirect observations[M].Studia Sci.Math.Hungar.,1967:299-318
[15] Morimoto T.Markov processes and the H-theorem[J].Journal of the Physical Society of Japan,1963,18(3):328-331
[16] Ali S M,Silvey S D.A general class of coefficients of divergence of one distribution from another[J].Journal of the Royal Statistical Society.Series B (Methodological),1966,8(1):131-142

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!