计算机科学 ›› 2018, Vol. 45 ›› Issue (7): 22-30.doi: 10.11896/j.issn.1002-137X.2018.07.004
杨,虎1,付宇2,范,丹1
YANG Hu1,FU Yu2,FAN Dan1
摘要: 聚类内部有效性指标是在未知样本真实分类情况下用于评价聚类结果优劣、寻找最佳聚类个数的指标,是聚类分析研究中的重要内容。虽然已有大量的研究分析了聚类内部有效性指标的性能,且有研究结论表明某些内部有效性指标的性能良好,能够辅助聚类算法找到最佳聚类个数,但这些研究未考虑真实数据中的噪音特征对内部有效性指标的影响,研究结论可能会误导内部有效性指标的选取和应用。为此,选取了10种常用的内部有效性指标来研究噪音特征对内部有效性特征选择和聚类结果的影响。结果表明,数据中的噪音特征会影响内部有效性指标的性能,除KL指标、CH指标和CCC指标对噪音特征的反应相对不敏感外,其他内部有效性指标均对噪音特征敏感,且聚类结果的准确性会随着噪音的增强而降低。
中图分类号:
[1]LEE J M,SONNHAMMER E L.Genomic gene clustering ana-lysis of pathways in eukaryotes.Genome Research,2003,13(5):875-882. [2]ZASLAVSKY L,CIUFO S,FEDOROV B,et al.Clusteringanalysis of proteins from microbial genomes at multiple levels of resolution.Bmc Bioinformatics,2016,17(8):545-552. [3]LI X,HIPEL K W,DANG Y.An improved grey relational ana-lysis approach for panel data clustering.Oxford:Pergamon Press,Inc.2015. [4]ARBELAITZ O,GURRUTXAGA I,MUGUERZA J,et al.An extensive comparative study of cluster validity indices.Pattern Recognition,2013,46(1):243-256. [5]BEN-DAVID S,LUXBURG U V, P L D.A Sober Look at Clustering Stability.Lecture Notes in Computer Science,2006, 4005:5-19. [6]SALEM S A,NANDI A K.Development of assessment criteria for clustering algorithms.Berlin:Springer-Verlag,2009. [7]BOLSHAKOVA N,AZUAJE F,CUNNINGHAM P.A know-ledge-driven approach to cluster validity assessment.Bioinformatics,2005,21(10):2546-2547. [8]YUE S,WANG J,WANG J,et al.A new validity index for eva-luating the clustering results by partitional clustering algorithms.Soft Computing,2016,20(3):1127-1138. [9]CHAWLA N.Discovering Knowledge in Data:An Introduction to Data Mining.Publications of the American Statistical Association,2014,100(472):1465-1465. [10]ZHAO Y,KARYPIS G.Evaluation of hierarchical clustering algorithms for document datasets∥Eleventh International Conference on Information & Knowledge Management.ACM,2002:515-524. [11]LIU Y,LI Z,XIONG H,et al.Understanding of Internal Clustering Validation Measures∥IEEE,International Conference on Data Mining.IEEE,2011:911-916. [12]GIANCARLO R,UTRO F.Algorithmic paradigms for stability-based cluster validity and model selection statistical methods,with applications to microarray data analysis.Theoretical Computer Science,2012,428(6):58-79. [13]GURRUTXAGA I,MUGUERZA J,ARBELAITZ O.Towards a standard methodology to evaluate internal cluster validity indices.Pattern Recognition Letters,2011,32(3):505-515. [14]JIANG D,TANG C,ZHANG A.Cluster analysis for gene expression data:a survey.IEEE Transactions on Knowledge & Data Engineering,2004,16(11):1370-1386. [15]SMYTH C,COOMANS D,EVERINGHAM Y.Clustering noisy data in a reduced dimension space via multivariate regression trees.Pattern Recognition,2006,39(3):424-431. [16]DUNNA^ J C.Well-Separated Clusters and Optimal Fuzzy Partitions.Journal of Cybernetics,1974,4(1):95-104. [17]CALIN′SKI T,HARABASZ J.A dendrite method for clusteranalysis.Communications in Statistics,1974,3(1):1-27. |
[1] | 庞天杰,赵兴旺. 一种基于先验信息的混合数据聚类个数确定算法 Algorithm to Determine Number of Clusters for Mixed Data Based on Prior Information 计算机科学, 2016, 43(2): 101-104. https://doi.org/10.11896/j.issn.1002-137X.2016.02.023 |
|