计算机科学 ›› 2016, Vol. 43 ›› Issue (9): 209-212.doi: 10.11896/j.issn.1002-137X.2016.09.041

• 人工智能 • 上一篇    下一篇

一种基于抽样的大规模混合数据聚类集成算法

庞天杰,梁吉业   

  1. 太原师范学院计算机系 太原030619,太原师范学院计算机系 太原030619;山西大学计算智能与中文信息处理教育部重点实验室 太原030006
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目:“用户行为数据”稀疏表示的理论与方法研究(61273294),山西省回国留学人员科研资助

Clustering Ensemble Algorithm for Large-scale Mixed Data Based on Sampling

PANG Tian-jie and LIANG Ji-ye   

  • Online:2018-12-01 Published:2018-12-01

摘要: 混合数据聚类是聚类分析中一个重要的问题。现有的混合数据聚类算法主要是在全体样本的相似性度量的基础上进行聚类,因此对大规模数据进行聚类时,算法效率不高。基于此,设计了一种新的抽样策略,在此基础上,提出了一种基于抽样的大规模混合数据聚类集成算法。该算法对利用新的抽样策略得到的多个样本子集分别进行聚类,并将结果集成得到最终聚类结果。实验证明,与改进的K-prototypes算法相比,该算法的效率有了显著提高,同时聚类有效性指标基本相同。

关键词: 聚类,大规模混合数据,聚类集成,抽样,有效性指标

Abstract: In clustering analysis,one of the important problems is mixed data clustering.The clustering of existing algorithms is mainly based on similarity measurement of all samples.Therefore,the efficiency of clustering for large-scale data is not high.So we designed a new sampling strategy and proposed an ensemble algorithm for large-scale mixed data based on sampling.This new algorithm clusters subsets which are obtained by the use of the new sampling strategy respectively and the final clustering results can be gotten by clustering ensemble.Experiment shows that the efficiency of algorithm is improved significantly and the clustering validity indexes are almost the same compared with the modified K-prototypes algorithm.

Key words: Clustering,Large-scale mixed data,Clustering ensembles,Sampling,Validity index

[1] MacQueen J B.Some methods for classification and analysis of multivariate observations[C]∥Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability.Berkeley:University of California,1967:281-297
[2] Ruspini E R.A new Approach to clustering [J].Information andControl,1969,15(1):22-32
[3] Camastra F,Verri A.A novel kernel method for clustering [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(5):801-805
[4] Zhang T,Ramakrishnan R,Livny M.BIRCH [C]∥Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.Quebec:ACM,1996:103-114
[5] Guha S,Rastogi R,Shim K.CURE:An efficient clustering algorithm for clustering large databases [C]∥Proceedings of the Symposium on Management of Data (SIGMOD).Seattle:ACM,1998:73-84
[6] Ester M,Kriegel H P,Sander J,et al.A density-based algorithm for discovering clusters inlarge spatial databases with noise [C]∥Proceedings of the 2th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.USA:AAAI,1996:226-231
[7] Huang Zhe-xue.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304
[8] Liang Ji-ye,Zhao Xing-wang,Li De-yu,et al.Determining the number of clusters using information entropy for mixed data[J].Pattern Recognition,2012,5(6):2251-2265
[9] He Zeng-you,Xu Xiao-fei,Deng Sheng-chun.Clustering Mixed Numeric and Categorical Data:A Cluster Ensemble Approach[J].Computer Science Artificial Intelligence,2005,5(4):225-268
[10] Luo Hui-lan, Wei Hui.Clustering Algorithm for Mixed DataBased on Clustering Ensemble Technique[J].Computer Scien-ce,2010,37(11):234-238(in Chinese) 罗慧兰,危辉.一种基于聚类集成技术的混合型数据聚类方法[J].计算机科学,2010,7(11):234-238
[11] Zhou Zhi-hua,Tang Wei.Clusterer ensemble[J].Knowledge-Based Systems,2006,9(1):77-83
[12] Yang Cao-yuan,Liu Da-you,Yang Bo,et al.Research on Cluster Aggregation Approaches[J].Computer Science, 2011,8(2):166-170(in Chinese) 杨草原,刘大有,杨博,等.聚类集成方法研究[J].计算机科学,2011,8(2):166-170

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!