计算机科学 ›› 2020, Vol. 47 ›› Issue (3): 79-86.doi: 10.11896/jsjkx.190400123

• 数据库&大数据&数据科学 • 上一篇    下一篇

复杂高维数据的密度峰值快速搜索聚类算法

陈俊芬,张明,赵佳成   

  1. (河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室 河北 保定071002)
  • 收稿日期:2019-04-22 出版日期:2020-03-15 发布日期:2020-03-30
  • 通讯作者: 陈俊芬(chenjunfen2010@126.com)
  • 基金资助:
    河北省自然科学基金(F2016201161);高层次创新人才科研启动经费项目

Clustering Algorithm by Fast Search and Find of Density Peaks for Complex High-dimensional Data

CHEN Jun-fen,ZHANG Ming,ZHAO Jia-cheng   

  1. (Hebei Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematics and Information Sciences, Hebei University, Baoding, Hebei 071002, China)
  • Received:2019-04-22 Online:2020-03-15 Published:2020-03-30
  • About author:CHEN Jun-fen,born in 1976.Ph.D,associate professor,master supervisor,is member of China Computer Federation.Her main research interests include data mining,machine learning and image processing.
  • Supported by:
    This work was supported by the Natural Science Foundation of Hebei Province, China (F2016201161) and Research Foundation for Advanced Scholars Program of Hebei University, China.

摘要: 机器学习的无监督聚类算法已被广泛应用于各种目标识别任务。基于密度峰值的快速搜索聚类算法(DPC)能快速有效地确定聚类中心点和类个数,但在处理复杂分布形状的数据和高维图像数据时仍存在聚类中心点不容易确定、类数偏少等问题。为了提高其处理复杂高维数据的鲁棒性,文中提出了一种基于学习特征表示的密度峰值快速搜索聚类算法(AE-MDPC)。该算法采用无监督的自动编码器(AutoEncoder)学出数据的最优特征表示,结合能刻画数据全局一致性的流形相似性,提高了同类数据间的紧致性和不同类数据间的分离性,促使潜在类中心点的密度值成为局部最大。在4个人工数据集和4个真实图像数据集上将AE-MDPC与经典的K-means,DBSCAN,DPC算法以及结合了PCA的DPC算法进行比较。实验结果表明,在外部评价指标聚类精度、内部评价指标调整互信息和调整兰德指数上,AE-MDPC的聚类性能优于对比算法,而且提供了更好的可视化性能。总之,基于特征表示学习且结合流形距离的AE-MDPC算法能有效地处理复杂流形数据和高维图像数据。

关键词: DPC算法, 聚类, 流形距离, 密度峰值, 特征表示

Abstract: Unsupervised clustering in machine learning is widely applied in various object recognition tasks.A novel clustering algorithm based on density peaks (DPC) can find out cluster center points quickly in decision graph and the number of clusters.However,when dealing with the data of complex distribution shape and high-dimensional image data,there are still some problems in DPC algorithm,such as difficult to determine the cluster center points and few clusters.In order to improve its robustness in dealing with complex high-dimensional data,an improved DPC clustering algorithm (AE-MDPC) was presented,which employs an autoencoder,a kind of unsupervised learning method,to obtain the optimal feature representation from input data,and manifold similarity of pairwise data to describe the global consistence.The autoencoder can reduce feature noises via reducing dimension of the high-dimensional image data,whilst manifold distance can lead to the densities of the potential cluster centers become global peaks.AE-MDPC algorithm was compared with K-means,DBSCAN,DPC and DPC combined PCA on four artificial datasets and four real face image datasets.The experimental results demonstrate that AE-MDPC outperforms the other clustering algorithms on clustering accuracy,adjusted mutual information and adjusted rand index,meanwhile AE-MDPC provides better clustering visualization.Overall,the proposed AE-MDPC algorithm can effectively handle complex manifold data and high-dimensional image data.

Key words: Clustering, Density peaks, DPC algorithm, Features representation, Manifold distance

中图分类号: 

  • TP181
[1]QUEEN J M.Some methods for classification and analysis of multivariate observations[C]∥Proc of the fifth Berkeley symposium on mathematical statistics and probability.Oakland:Lucien Marie Le Cam,1967:281-297.
[2]ISMKHAN H.I-k-means-+:An Iterative Clustering Algorithm Based on an Enhanced Version of the k-means[J].Pattern Re-cognition,2018,79:402-413.
[3]JIA R Y,LI Y G.K-means algorithm for self-determination of cluster number and initial center[J].Computer Engineering and Application,2018,54 (7):152-158.
[4]BIRANT D,KUT A.ST-DBSCAN:An algorithm for clustering spatial-temporal data[J].Data & Knowledge Engineering,2007,60(1):208-221.
[5]HOU J,GAO H J,LI X L.DSets-DBSCAN:A Parameter-Free Clustering Algorithm[J].IEEE Transactions on Image Proces-sing,2016,25(7):3182-3193.
[6]TRAN T N,DRAB K,DASZYKOWSKI M.Revised DBSCAN algorithm to cluster data with dense adjacent clusters[J].Chemometrics & Intelligent Laboratory Systems,2013,120(2):92-96.
[7]RODRIGUEZ A,LAIO A.Clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.
[8]XIE J Y,GAO H C,XIE W X.K Nearest Neighbor Optimized Density Peak Fast Search Clustering Algorithms[J].Chinese Science:Information Science,2016,46 (2):258-280.
[9]MEHMOOD R,BIE R,DAWOOD H,et al.Fuzzy Clustering by Fast Search and Find of Density Peaks[J].Personal & Ubiquitous Computing,2016,20(5):785-793.
[10]LI C Y,DING G Y,WANG D K.Clustering by Fast Search and Find of Density Peaks with Data Field[J].Chinese Journal of Electronics,2016,25(3):397-402.
[11]XU J,WANG G Y,DENG W H.DenPEHC:Density Peak based Efficient Hierarchical Clustering[J].Information Sciences,2016,373(12):200-218.
[12]LU Y H,XIA C.Optimal K-Nearest Neighbor and Local Density Clustering Algorithms for Uncertain Data[J].Control and Decision-making,2016,31(3):541-546.
[13]WANG P F,YANG Y W,KE Y Q.Research on Optimization of fast clustering algorithm for peak density[J].Computer Engineering and Science,2018,40(8):1503-1510.
[14]XIE J Y,QU Y N.K-medoids clustering algorithm for initial center of peak density optimization [J].Computer Science and Exploration,2016,10(2):230-247.
[15]WANG S L,WANG D K,LI C Y,et al.Clustering by fast search and find of density peaks with data field[J].Chinese Journal of Electronics,2016,25(3):397-402.
[16]XIE J Y,GAO H C,XIE W X,et al.Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J].Information Sciences,2016,354(C):19-40.
[17]DU M J,DING S F,JIA H J.Study on density peaks clustering based on k-nearest neighbors and principal component analysis[J].Knowledge-Based Systems,2016,99:135-145.
[18]LIU R,WANG H,YU X M.Shared-nearest-neighbor-based Clustering by Fast Search and Find of Density Peaks[J].Information Sciences,2018,450:200-226.
[19]GOTTUMUKKAL R.An improved face recognition technique based on modular PCA approach[J].Pattern Recognit Lett,2004,25(4):429-436.
[20]KE Y,SUKTHANKAR R.PCA-SIFT:a more distinctive representation for local image descriptors[C]∥Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington:IEEE,2004.
[21]ZHANG J Q,ZHANG H Y.Fast search clustering algorithm for density peak based on manifold distance[J].ComputerKnow-ledge and Technology,2017,13(2):179-182.
[22]YANG H,FU Y,FAN D.The effect of noise characteristics on the internal validity of clustering[J].Computer Science,2018,45(7):22-30.
[23]TENENBAUM J B,SILVA V D E,LANGFORD J C.A global geometric framework for nonlinear dimensionality reduction[J].Science,2000,290(5500):2319-2323.
[24]RUMELHART D E,HINTON G E,WILLIAMS R J.Learning representations by back-propagating errors[J].Nature,1986,323(6088):533-536.
[25]YOSHUA B,LAMBLIN P,POPOVICI D,et al.Greedy layer- wise training of deep networks[C]∥Advances in Neural Information Processing Systems (NIPS’06).2006:153-160.
[26]VINH N X,EPPS J,BAILEY J.Bibliometrics:Information theo- retic measures for clusterings comparison[C]∥Proc of the International Conference on Machine Learning.New York:ACM,2010,2837-2854.
[27]YANN L C,L′EON B,YOSHUA B,et al.Gradient-based lear- ning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[1] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[2] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[3] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于DBSCAN聚类的集群联邦学习方法
Clustered Federated Learning Methods Based on DBSCAN Clustering
计算机科学, 2022, 49(6A): 232-237. https://doi.org/10.11896/jsjkx.211100059
[4] 郁舒昊, 周辉, 叶春杨, 王太正.
SDFA:基于多特征融合的船舶轨迹聚类方法研究
SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion
计算机科学, 2022, 49(6A): 256-260. https://doi.org/10.11896/jsjkx.211100253
[5] 毛森林, 夏镇, 耿新宇, 陈剑辉, 蒋宏霞.
基于密度敏感距离和模糊划分的改进FCM算法
FCM Algorithm Based on Density Sensitive Distance and Fuzzy Partition
计算机科学, 2022, 49(6A): 285-290. https://doi.org/10.11896/jsjkx.210700042
[6] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[7] 刘丽, 李仁发.
医疗CPS协作网络控制策略优化
Control Strategy Optimization of Medical CPS Cooperative Network
计算机科学, 2022, 49(6A): 39-43. https://doi.org/10.11896/jsjkx.210300230
[8] 陈佳舟, 赵熠波, 徐阳辉, 马骥, 金灵枫, 秦绪佳.
三维城市场景中的小物体检测
Small Object Detection in 3D Urban Scenes
计算机科学, 2022, 49(6): 238-244. https://doi.org/10.11896/jsjkx.210400174
[9] 邢云冰, 龙广玉, 胡春雨, 忽丽莎.
基于SVM的类别增量人体活动识别方法
Human Activity Recognition Method Based on Class Increment SVM
计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
[10] 朱哲清, 耿海军, 钱宇华.
面向化学结构的线段聚类算法
Line-Segment Clustering Algorithm for Chemical Structure
计算机科学, 2022, 49(5): 113-119. https://doi.org/10.11896/jsjkx.210700131
[11] 张宇姣, 黄锐, 张福泉, 隋栋, 张虎.
基于菌群优化的近邻传播聚类算法研究
Study on Affinity Propagation Clustering Algorithm Based on Bacterial Flora Optimization
计算机科学, 2022, 49(5): 165-169. https://doi.org/10.11896/jsjkx.210800218
[12] 左园林, 龚月姣, 陈伟能.
成本受限条件下的社交网络影响最大化方法
Budget-aware Influence Maximization in Social Networks
计算机科学, 2022, 49(4): 100-109. https://doi.org/10.11896/jsjkx.210300228
[13] 韩洁, 陈俊芬, 李艳, 湛泽聪.
基于自注意力的自监督深度聚类算法
Self-supervised Deep Clustering Algorithm Based on Self-attention
计算机科学, 2022, 49(3): 134-143. https://doi.org/10.11896/jsjkx.210100001
[14] 杨旭华, 王磊, 叶蕾, 张端, 周艳波, 龙海霞.
基于节点相似性和网络嵌入的复杂网络社区发现算法
Complex Network Community Detection Algorithm Based on Node Similarity and Network Embedding
计算机科学, 2022, 49(3): 121-128. https://doi.org/10.11896/jsjkx.210200009
[15] 蒲实, 赵卫东.
一种面向动态科研网络的社区检测算法
Community Detection Algorithm for Dynamic Academic Network
计算机科学, 2022, 49(1): 89-94. https://doi.org/10.11896/jsjkx.210100023
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!