计算机科学 ›› 2016, Vol. 43 ›› Issue (1): 89-93, 121.doi: 10.11896/j.issn.1002-137X.2016.01.021

• 第五届全国智能信息处理学术会议 • 上一篇    下一篇

符号数据的无监督学习:一种空间变换方法

王建新,钱宇华   

  1. 山西大学计算机与信息技术学院 太原030006;计算智能与中文信息处理教育部重点实验室 太原030006,山西大学计算机与信息技术学院 太原030006;山西大学智能信息处理研究所 太原030006;计算智能与中文信息处理教育部重点实验室 太原030006
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家优秀青年基金项目(61322211),教育部新世纪人才支持计划(NCET-12-1031),教育部博士点专项科研基金项目(20121401110013),山西省青年学术带头人(20120301)资助

Unsupervised Learning from Categorical Data:A Space Transformation Approach

WANG Jian-xin and QIAN Yu-hua   

  • Online:2018-12-01 Published:2018-12-01

摘要: 近年来符号数据的无监督学习在模式识别、机器学习、数据挖掘和知识发现等诸多领域扮演着越来越重要的角色。然而现有的针对符号数据的聚类算法(经典的K-modes系列算法等),相比数值型数据的聚类算法,在性能方面仍然有很大的提升空间。其根本原因在于符号数据缺乏类似数值数据那样清晰的空间结构。为了能够有效地发掘符号数据内在的空间结构,采用了一种全新的数据表示方案:空间变换方法。该方法将符号数据映射到相应的由原来的属性组成的新的 维度的欧氏空间中。在这一框架的基础上,为了找到符号数据更有代表性的模式,结合Carreira-Perpin提出的K-modes算法进行无监督学习。在9个常用的UCI符号数据集上进行了测试,与传统的符号数据聚类算法进行了实验比较,结果表明几乎在所有的数据集上提出的方法都是更加有效的。

关键词: 符号数据,数据表示方案,空间变换

Abstract: The unsupervised learning method of categorical data plays a more and more important role in such areas as pattern recognition,machine learning,data mining and knowledge discovery in the recent years.Nevertheless,in view of many existing clustering algorithms for categorical data (the classical k-modes algorithm and so on),there is still a large room for improving their clustering performance in comparison with the performance of clustering algorithms for numeric data.This may arise from the fact that categorical data lack a clear space structure as that of numeric data.To effectively discover the space structure inherent in a set of categorical objects,we adopted a novel data representation scheme:a space transformation approach,which maps a set of categorical objects into a corresponding Euclidean space with the new dimensions constructed by each of the original features.Based on the new general framework for categorical clustering,we employed the Carreira-Perpin’s K-modes algorithm for clustering to find more representative modes.The performance of the new proposed method was tested on the nine frequently-used categorical data sets downloaded from the UCI.Comparisons with the traditional clustering algorithms for categorical data illustrate the effectiveness of the new method on almost all data sets.

Key words: Categorical data,Data representation scheme,Space transformation

[1] Qian Yu-hua,Li Fei-jiang,Liang Ji-ye,et al.Space structure and clustering of categorical data[J].IEEE Transactions on Neural Networks and Learning Systems,2015,99:1-13
[2] Carreira-Perpin M A,Wang Wei-ran.The K-modes algorithm for clustering[J].arXiv preprint arXiv:1304.6478,2013
[3] Huang Zhe-xue.A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining[M]∥Research Issues on Data Mining & Knowledge Discovery.1998:1-8
[4] Chan E Y,Ching W K,Ng M K,et al.An optimizationalgorithm for clustering using weighted dissimilaity measure[J].Pattern Recoginzation,2004,7(5):943-952
[5] Bai Liang,Liang Ji-ye,Dang Chuang-yin,et al.The impact ofcluster representatives on the convergence of the K-modes type clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(6):1509-1522
[6] Yang Yi-ming.An evaluation of statistical approaches to textcategorization[J].Information Retrieval,1999,1(1/2):69-90
[7] Information G M.Uncertainty and the utility of categories[C]∥Proc.of the Seventh Annual Conf.on Cognitive Science Society.Lawrence Erlbaum,1985:283-287
[8] Barbará D,Li Yi,Couto J.COOLCAT:an entropy-based algorithm for categorical clustering[C]∥Proceedings of the Ele-venth International Conference on Information and Knowledge Management.ACM,2002:582-589
[9] Aggarwal C C,Procopiuc C,Yu P S.Finding localized associations in market basket data[J].IEEE Transactions on Know-ledge and Data Engineering,2002,14(1):51-62
[10] Cao Fu-yuan,Liang Ji-ye,Bai Liang,et al.A framework for clustering categorical time-evolving data[J].IEEE Transactions on Fuzzy Systems,2010,18(5):872-882
[11] Wrigley N.Categorical data analysis for geographers and environmental scientists[M].Blackburn Press,2012
[12] Chmielewski M R,Grzymala-Busse J W.Global discretization of continuous attributes as preprocessing for machine learning[J].International Journal of Approximate Reasoning,1996,15(4):319-331
[13] Dash M,Liu Huan.Consistency-based search in feature selection[J].Artificial Intelligence,2003,151(1):155-176
[14] Guyon I,Elisseeff A.An introduction to variable and feature selection[J].The Journal of Machine Learning Research,2003,3:1157-1182
[15] Zhou Zhi-hua.Three perspectives of data mining[J].Artificial Intelligence,2003,3(1):139-146
[16] Huang Zhe-xue.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304
[17] Lee M,Pedrycz W.The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features[J].Fuzzy Sets and Systems,2009,160(24):3590-3600
[18] Yu Jian.General C-means clustering model[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(8):1197-1211
[19] Alamuri M,Surampudi B R,Negi A.A survey of distance/similarity measures for categorical data[C]∥2014 International Joint Conference on Neural Networks (IJCNN).IEEE,2014:1907-1914
[20] Andritsos P,Tsaparas P,Miller R J,et al.LIMBO:Scalable clustering of categorical data[M]∥Advances in Database Technology-EDBT 2004.Springer Berlin Heidelberg,2004:123-146
[21] Chan E Y,Ching W K,Ng M K,et al.An optimization algorithm for clustering using weighted dissimilarity measures[J].Pattern Recognition,2004,37(5):943-952
[22] Comaniciu D,Meer P.Mean shift:A robust approach toward feature space analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(5):603-619

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .