Computer Science ›› 2016, Vol. 43 ›› Issue (1): 89-93.doi: 10.11896/j.issn.1002-137X.2016.01.021

Previous Articles     Next Articles

Unsupervised Learning from Categorical Data:A Space Transformation Approach

WANG Jian-xin and QIAN Yu-hua   

  • Online:2018-12-01 Published:2018-12-01

Abstract: The unsupervised learning method of categorical data plays a more and more important role in such areas as pattern recognition,machine learning,data mining and knowledge discovery in the recent years.Nevertheless,in view of many existing clustering algorithms for categorical data (the classical k-modes algorithm and so on),there is still a large room for improving their clustering performance in comparison with the performance of clustering algorithms for numeric data.This may arise from the fact that categorical data lack a clear space structure as that of numeric data.To effectively discover the space structure inherent in a set of categorical objects,we adopted a novel data representation scheme:a space transformation approach,which maps a set of categorical objects into a corresponding Euclidean space with the new dimensions constructed by each of the original features.Based on the new general framework for categorical clustering,we employed the Carreira-Perpin’s K-modes algorithm for clustering to find more representative modes.The performance of the new proposed method was tested on the nine frequently-used categorical data sets downloaded from the UCI.Comparisons with the traditional clustering algorithms for categorical data illustrate the effectiveness of the new method on almost all data sets.

Key words: Categorical data,Data representation scheme,Space transformation

[1] Qian Yu-hua,Li Fei-jiang,Liang Ji-ye,et al.Space structure and clustering of categorical data[J].IEEE Transactions on Neural Networks and Learning Systems,2015,99:1-13
[2] Carreira-Perpin M A,Wang Wei-ran.The K-modes algorithm for clustering[J].arXiv preprint arXiv:1304.6478,2013
[3] Huang Zhe-xue.A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining[M]∥Research Issues on Data Mining & Knowledge Discovery.1998:1-8
[4] Chan E Y,Ching W K,Ng M K,et al.An optimizationalgorithm for clustering using weighted dissimilaity measure[J].Pattern Recoginzation,2004,7(5):943-952
[5] Bai Liang,Liang Ji-ye,Dang Chuang-yin,et al.The impact ofcluster representatives on the convergence of the K-modes type clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(6):1509-1522
[6] Yang Yi-ming.An evaluation of statistical approaches to textcategorization[J].Information Retrieval,1999,1(1/2):69-90
[7] Information G M.Uncertainty and the utility of categories[C]∥Proc.of the Seventh Annual Conf.on Cognitive Science Society.Lawrence Erlbaum,1985:283-287
[8] Barbará D,Li Yi,Couto J.COOLCAT:an entropy-based algorithm for categorical clustering[C]∥Proceedings of the Ele-venth International Conference on Information and Knowledge Management.ACM,2002:582-589
[9] Aggarwal C C,Procopiuc C,Yu P S.Finding localized associations in market basket data[J].IEEE Transactions on Know-ledge and Data Engineering,2002,14(1):51-62
[10] Cao Fu-yuan,Liang Ji-ye,Bai Liang,et al.A framework for clustering categorical time-evolving data[J].IEEE Transactions on Fuzzy Systems,2010,18(5):872-882
[11] Wrigley N.Categorical data analysis for geographers and environmental scientists[M].Blackburn Press,2012
[12] Chmielewski M R,Grzymala-Busse J W.Global discretization of continuous attributes as preprocessing for machine learning[J].International Journal of Approximate Reasoning,1996,15(4):319-331
[13] Dash M,Liu Huan.Consistency-based search in feature selection[J].Artificial Intelligence,2003,151(1):155-176
[14] Guyon I,Elisseeff A.An introduction to variable and feature selection[J].The Journal of Machine Learning Research,2003,3:1157-1182
[15] Zhou Zhi-hua.Three perspectives of data mining[J].Artificial Intelligence,2003,3(1):139-146
[16] Huang Zhe-xue.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304
[17] Lee M,Pedrycz W.The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features[J].Fuzzy Sets and Systems,2009,160(24):3590-3600
[18] Yu Jian.General C-means clustering model[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(8):1197-1211
[19] Alamuri M,Surampudi B R,Negi A.A survey of distance/similarity measures for categorical data[C]∥2014 International Joint Conference on Neural Networks (IJCNN).IEEE,2014:1907-1914
[20] Andritsos P,Tsaparas P,Miller R J,et al.LIMBO:Scalable clustering of categorical data[M]∥Advances in Database Technology-EDBT 2004.Springer Berlin Heidelberg,2004:123-146
[21] Chan E Y,Ching W K,Ng M K,et al.An optimization algorithm for clustering using weighted dissimilarity measures[J].Pattern Recognition,2004,37(5):943-952
[22] Comaniciu D,Meer P.Mean shift:A robust approach toward feature space analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(5):603-619

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!