计算机科学 ›› 2021, Vol. 48 ›› Issue (10): 191-196.doi: 10.11896/jsjkx.200800191

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于密度峰值聚类的高斯混合模型算法

王卫东, 徐金慧, 张志峰, 杨习贝   

  1. 江苏科技大学计算机学院 江苏 镇江212100
  • 收稿日期:2020-08-28 修回日期:2020-11-30 出版日期:2021-10-15 发布日期:2021-10-18
  • 通讯作者: 王卫东(78653221@qq.com)
  • 基金资助:
    国家自然科学基金(61572242)

Gaussian Mixture Models Algorithm Based on Density Peaks Clustering

WANG Wei-dong, XU Jin-hui, ZHANG Zhi-feng, YANG Xi-bei   

  1. College of Computer Science,Jiangsu University of Science and Technology,Zhenjiang,Jiangsu 212100,China
  • Received:2020-08-28 Revised:2020-11-30 Online:2021-10-15 Published:2021-10-18
  • About author:WANG Wei-dong,Ph.D,associate professor.His main research interests include pattern recognition and intelligent information processing.
  • Supported by:
    National Natural Science Foundation of China(61572242).

摘要: 由于存在大量服从高斯分布的样本数据,采用高斯混合模型(Gaussian Mixture Models,GMM)对这些样本数据进行聚类分析,可以得到比较准确的聚类结果。通常采用EM算法(Expectation Maximization Algorithm)对GMM的参数进行迭代式估计。但传统EM算法存在两点不足:对初始聚类中心的取值比较敏感;迭代式参数估计的迭代终止条件是相邻两次估计参数的距离小于给定的阈值,这不能保证算法收敛于参数的最优值。为了弥补上述不足,提出采用密度峰值聚类(Density Peaks Clustering,DPC)来初始化EM算法,以提高算法的鲁棒性,采用相对熵作为EM算法的迭代终止条件,实现对GMM算法参数值的优化选取。在人工数据集及UCI数据集上的对比实验表明,所提算法不但提高了EM算法的鲁棒性,而且其聚类结果优于传统算法。尤其在服从高斯分布的数据集上的实验结果显示,所提算法大幅提高了聚类精度。

关键词: EM算法, 高斯混合模型, 聚类算法, 密度峰值聚类, 相对熵

Abstract: Due to the existence of a large number of sample data which obey the Gaussian distribution,GMM (Gaussian mixture models) is used to cluster these sample data and get more accurate clustering results.In general,EM algorithm(expectation maxi-mization algorithm) is used to estimate the parameters of GMM iteratively.However,the traditional EM algorithm has two shortcomings:it is sensitive to the initial clustering center;the itera-tive termination condition of iterative parameter estimation is to judge that the distance between two adjacent estimated parameters is less than a given threshold,which can't guarantee that the algorithm converges to the optimal value of the parameters.In order to overcome the above shortcomings,density peaks clustering (DPC) is proposed to initialize EM algorithm to improve the robustness of the algorithm.The relative entropy is used as the ite-ration termination condition of the EM algorithm to optimize the parameters of GMM algorithm.The comparative experiments on artificial datasets and UCI datasets show that the new algorithm not only improves the robustness of EM algorithm,but also outperforms the traditional clustering algorithm.On the datasets which obey Gaussian distribution,the new algorithm greatly improves the clustering accuracy.

Key words: Clustering algorithm, Density peaks clustering, Expectation maximization algorithm, Gaussian mixture models, Relative entropy

中图分类号: 

  • TP391.4
[1]BENABDELLAH A C,BENGHABRIT A,BOUHADDOU I.A survey of clustering algorithms for an industrial context[J].Procedia Computer Science,2019,148:291-302.
[2]OYELADE J,ISEWON I,OLADIPUPO O,et al.Data Clustering:Algorithms and Its Applications[C]//2019 19th International Conference on Computational Science and Its Applications (ICCSA).Saint Petersburg,Russia,2019:71-81.
[3]CHAI W Y,YANG F,YUAN S F,et al.Multi-class Gaussian Mixture Model and Neighborhood Information Based Gaussian Mixture Model for Image Segmentation[J].Computer Science,2018,45(11):272-277.
[4]ZOU C M,CHEN D.Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis[J].Computer Science,2021,48(2):121-127.
[5]MCQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability.Los Angeles:University of California,1967:281-297.
[6]SHEN H,DUAN Z.Application Research of Clustering Algorithm Based on K-Means in Data Mining[C]//2020 Internatio-nal Conference on Computer Information and Big Data Applications (CIBDA).Guiyang,China,2020:66-69.
[7]SAPKOTA N,ALSADOON A,PRASAD P W C,et al.Data Summarization Using Clustering and Classification:Spectral Clustering Combined with k-Means Using NFPH[C]//2019 International Conference on Machine Learning,Big Data,Cloud and Parallel Computing (COMITCon).Faridabad,India,2019:146-151.
[8]ZAKIR H,NASIM A,AHMADR B,et al.A dynamic K-means clustering for data mining[J].Indonesian Journal of Electrical Engineering and Computer Science,2019,13(2):521-526.
[9]ALEX R,ALESSANDRO L.Clustering by fast search and find of density peaks[J].Science,2014,344:1492-1496.
[10]DEMPSTERA P,LAIRDN M,RUBIND B.Maximum Likeli-hood from Incomplete Data via the EM Algorithm[J].Journal of the Royal Statistical Society,1997,39(1):1-38.
[11]YANG M S,LAI C Y,LIN C Y.A robust EM clustering algorithm for Gaussian mixture models[J].Pattern Recognition,2012,45(11):3950-3961.
[12]YUE J,WANG S T.Algorithm EM and Its Initialization inGaussian Mixture Model Based Clustering[J].Microcomputer Information,2006,11(22):244-247.
[13]XIE J Y,GAO H C,XIE W X.K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a data set[J].Scientian Sinica Informationis,2016,46(2):258-280.
[14]WANG S L,WANG D K,LI C Y,et al.Clustering by fastsearch and find of density peaks with data field[J].Chinese Journal of Electronics,2016,3(25):397-402.
[15]ZHANG W K,LI J.Extended fast search clustering algorithm:widely density clusters,no density peaks[J].arXiv:1505.05160,2015.
[16]JI X,YAO S,ZHAO P.Relative Neighborhood and PruningStrategy Optimized Density Peaks Clustering Algorithm[J].ACTA Automatica Sinica,2020,46(3):562-575.
[17]YANG W,CAI L,WU F.Image segmentation based on gray level and local relative entropy two dimensional histogram[J].PLOS ONE,2020,15(3):1-9.
[18]GIONIS A,MANNILA H,TSAPARAS P.clustering aggregation[C]//Proceedings of ACM Transactions on Knowledge Discovery from Data.2007,1(1):1-30.
[19]LIMIN F,ENZO M.Flame,a novel fuzzy clustering method for the analysis of DNA microarray data[J].BMC Bioininforma-tics,2007,8(1):3-17.
[20]CHANG H,YEUNG D Y.Robust path-based spectral cluste-ring[J].Pattern Recognition,2008,41(1):191-203.
[21]LI C M.UCI machine learning repository [EB/OL].http://archive.ics.uci.edu/ml.
[22]VINH N X,EPPS J,BAILEY J.Information theoretic measures for clustering comparison:is a correction for chance necessary? [C]//Proceedings of ICML'09.Montreal,2009:1073-1080.
[1] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[2] 张亚迪, 孙悦, 刘锋, 朱二周.
结合密度参数与中心替换的改进K-means算法及新聚类有效性指标研究
Study on Density Parameter and Center-Replacement Combined K-means and New Clustering Validity Index
计算机科学, 2022, 49(1): 121-132. https://doi.org/10.11896/jsjkx.201100148
[3] 吴少波, 傅启明, 陈建平, 吴宏杰, 陆悠.
基于相对熵的元逆强化学习方法
Meta-inverse Reinforcement Learning Method Based on Relative Entropy
计算机科学, 2021, 48(9): 257-263. https://doi.org/10.11896/jsjkx.200700044
[4] 李杉, 许新征.
基于双角度并行剪枝的VGG16优化方法
Parallel Pruning from Two Aspects for VGG16 Optimization
计算机科学, 2021, 48(6): 227-233. https://doi.org/10.11896/jsjkx.200800016
[5] 汤鑫瑶, 张正军, 储杰, 严涛.
基于自然最近邻的密度峰值聚类算法
Density Peaks Clustering Algorithm Based on Natural Nearest Neighbor
计算机科学, 2021, 48(3): 151-157. https://doi.org/10.11896/jsjkx.200100112
[6] 邹承明, 陈德.
高维大数据分析的无监督异常检测方法
Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis
计算机科学, 2021, 48(2): 121-127. https://doi.org/10.11896/jsjkx.191100141
[7] 王茂光, 杨行.
一种基于AP-Entropy选择集成的风控模型和算法
Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble
计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110
[8] 张煜, 陆亿红, 黄德才.
基于密度峰值的加权犹豫模糊聚类算法
Weighted Hesitant Fuzzy Clustering Based on Density Peaks
计算机科学, 2021, 48(1): 145-151. https://doi.org/10.11896/jsjkx.200400043
[9] 徐守坤, 倪楚涵, 吉晨晨, 李宁.
基于YOLOv3的施工场景安全帽佩戴的图像描述
Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3
计算机科学, 2020, 47(8): 233-240. https://doi.org/10.11896/jsjkx.190600109
[10] 田献珍, 孙立强, 田振中.
基于蚁群算法的图像重建
Image Reconstruction Based on Ant Colony Algorithm
计算机科学, 2020, 47(11A): 231-235. https://doi.org/10.11896/jsjkx.191000128
[11] 邓定胜.
一种改进的DBSCAN算法在Spark平台上的应用
Application of Improved DBSCAN Algorithm on Spark Platform
计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071
[12] 张建新, 刘弘, 李焱.
一种面向人群疏散的高效分组方法
Efficient Grouping Method for Crowd Evacuation
计算机科学, 2019, 46(6): 231-238. https://doi.org/10.11896/j.issn.1002-137X.2019.06.035
[13] 胡闯, 杨庚, 白云璐.
面向差分隐私保护的聚类算法
Clustering Algorithm in Differential Privacy Preserving
计算机科学, 2019, 46(2): 120-126. https://doi.org/10.11896/j.issn.1002-137X.2019.02.019
[14] 陈子豪, 李强.
基于K-medoids的改进PBFT共识机制
Improved PBFT Consensus Mechanism Based on K-medoids
计算机科学, 2019, 46(12): 101-107. https://doi.org/10.11896/jsjkx.181002014
[15] 张天柱, 邹承明.
使用模糊聚类的胶囊网络在图像分类上的研究
Study on Image Classification of Capsule Network Using Fuzzy Clustering
计算机科学, 2019, 46(12): 279-285. https://doi.org/10.11896/jsjkx.190200315
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!