基于密度峰值和网格的自动选定聚类中心算法

doi:10.11896/j.issn.1002-137X.2017.11A.085

摘要/Abstract

摘要： 针对快速搜索和发现密度峰值的聚类算法(DPC)中数据点之间计算复杂,最终聚类的中心个数需要通过决策图手动选取等问题,提出基于密度峰值和网格的自动选定聚类中心的改进算法GADPC。首先结合Clique网格聚类算法的思想,不再针对点对象进行操作,而是将点映射到网格,并将网格作为聚类对象,从而减少了DPC算法中对数据点之间的距离计算和聚类次数；其次通过改进后的聚类中心个数判定准则更精确地自动选定聚类中心个数；最后对网格边缘点和噪声点,采用网格内点对象和相邻网格间的相似度进行了处理。实验通过采用UEF(University of Eastern Finland)提供的数据挖掘使用的人工合成数据集和UCI自然数据集进行对比,其聚类评价指标(Rand Index)表明,改进的算法在计算大数据集时聚类质量不低于DPC和K-means算法,而且提高了DPC算法的处理效率。

关键词: 数据挖掘,聚类分析,密度峰值,网格,相似度

Abstract: Aiming at the shortcomings of clustering by fast search and find of density peaks algorithm(DPC),which calculates massive distance between point objects,has high computational-complexity about clustering process,and needs to select the final cluster centers manually,an improved algorithm that choose clustering centers automatically based on density peak and grid(GADPC) was proposed.Firstly,with the idea of Clique algorithm,all data points are mapped to grid clustering with grid objects rather than point objects,in order to reduce the distance computation and clustering complexity of DPC algorithm.Secondly,the decision accuracy of the number of cluster centers is improved so that it can automatically select cluster centers more precisely.Finally,the relative similarity between grid internal points and adjacent grid points is dealt,so that the edge points and noise points can be solved well.Comparing with machine learning synthetic data sets of UEF and UCI natural data sets,the rand index of those data sets shows that the clustering quality of the improved algorithm is not lower than DPC and K-means algorithm when calculating large data sets,and it improves the dealing efficiency of DPC algorithm.

Key words: Data mining,Clustering analysis,Density peak,Grid,Similarity

夏庆亚. 基于密度峰值和网格的自动选定聚类中心算法[J]. 计算机科学, 2017, 44(Z11): 403-406. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.085

XIA Qing-ya. Automatically Selecting Clustering Centers Algorithm Based on Density Peak and Grid[J]. Computer Science, 2017, 44(Z11): 403-406. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.085

参考文献

[1] ARABIE P,HUBERT LJ.An Overview of Combinatorial Data Analysis[M]∥Clustering and Classification.2003 :5-63.
[2] MICHALSKI R S,STEPP R E.Learning from Observation:Conceptual Clustering[M].Machine Learning:An Articial Intelligence Approach,1983:331-363.
[3] FUKUNAGE K.Introduction to Statistic Pattern Recognition[M].Academic Press,1990.
[4] QIAN W,ZHOU A.Analyzing popular clustering algorithmsfrom different viewpoints[J].Journal of Software,2002,13(8):1382-1394.
[5] YANG W,WANG T,LI J D.Clustering parameter selection algorithm based on density for divisional clustering process[J].Control & Decision,2016,31(1):21-29.
[6] LLOYD S.Least squares quantization in PCM[J].IEEE Tran-sactions on Information Theory,1982,28(2):129-137.
[7] 夏宁霞,苏一丹,覃希.一种高效的K-medoids聚类算法[J].计算机应用研究,2010,27(12):4517-4519.
[8] NG R T,HAN J.CLARANS:A Method for Clustering Objects for Spatial Data Mining[J].IEEE Transactions on Knowledge &Data Engineering,2002,14(5):1003-1016.
[9] ESTER B M,KRIEGEL H P,SANDER J.et al.A DensityBased algorithm for discovering clusters in large spatial data-bases[C]∥Proceedings of International Conference on know-ledge Discovery and Data Mining.AAAI,1996:226-231.
[10] SANDER J,ESTER M,KRIEGEL H P,et al.Density Based Clustering in Spatial Databases:The Algorithm GDBSCAN and Its Applications[J].Data Mining &Knowledge Discovery,1998,2(2):169-194.
[11] ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering Points to Identify the Clustering Structure[J].Stanford Research Inst Memo Stanford University,1999,28(2):49-60.
[12] GUHA S,RASTOGI R,SHIM K.CURE:An Efficient Clustering Algorithm for Large Databases[C]∥Proc.of the ACM SIGMOD International Conference on Management of Data.1998:73-84.
[13] KARYPIS G,HAN E H,KUMAR V.CHAMELEON:A Hie-rarchical Clustering Algorithm Using Dynamic Modeling[J].IEEE Computer,1999,32(8):68-75.
[14] ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases[J].AcmSigmod Record,1996,25(2):103-114.
[15] WANG W,YANG J,MUNTZ R R.STING:A Statistical Information Grid Approach to Spatial Data Mining[C]∥Proceedings of the 23rd International Conference on Very Large Data Bases.Morgan Kaufmann Publishers Inc.1997:186-195.
[16] SHEIKHOLESLAMI G,CHATTERJEE S, ZHANG A.WaveCluster:A Multi-Resolution Clustering Approach for Very Large Spatial Databases[C]∥International Conference on Very Large Data Bases.Morgan Kaufmann Publishers Inc.1998:428-439.
[17] AGRAWAL R,GEHRKE J E,GUNOPULOS D,et al.Auto-matic subspace clustering of high dimensional data for data-mining applications[M]∥ACM SIGMOD Record.ACM,1998:94-105.
[18] RODRIGUEZ A,LIAO A.Clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.
[19] 何熊熊,管俊轶,叶宣佐,等.一种基于密度和网格的簇心可确定聚类算法[J].控制与决策,2016,7(5):913-919.
[20] GOIL S,NAGESH H,CHOUDHARY A.MAFIA:Ecient and Scalable Subspace Clustering for Very Large Data Sets[R].Technical Report,1999.
[21] MEHMOOD R,BIE R,DAWOOD H,et al.Fuzzy clustering by fast search and find of density peaks[C]∥International Confe-rence on Identification,Information,and Knowledge in the Internet of Things.IEEE,2016:258-261.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed