计算机科学 ›› 2018, Vol. 45 ›› Issue (7): 22-30.doi: 10.11896/j.issn.1002-137X.2018.07.004

• 第五届CCF 大数据学术会议 • 上一篇    下一篇

噪音特征对聚类内部有效性的影响

杨,虎1,付宇2,范,丹1   

  1. 中央财经大学信息学院 北京1000811 ;
    中国人民大学统计学院 北京1008722
  • 收稿日期:2017-06-25 出版日期:2018-07-30 发布日期:2018-07-30
  • 作者简介:杨 虎(1983-),男,博士,副教授,主要研究领域为大数据分析与统计计算、聚类分析算法,E-mail:hu.yang@cufe.edu.cn(通信作者);付 宇(1995-),男,硕士生,主要研究领域为聚类分析,E-mail:386340673@qq.com;范 丹(1982-),女,博士,讲师,主要研究领域为自适应建模、时间序列分析,E-mail:fandan@cufe.edu.cn。
  • 基金资助:
    本文受国家自然科学基金青年科学基金项目(71701223)资助。

Influence of Noisy Features on Internal Validation of Clustering

YANG Hu1,FU Yu2,FAN Dan1   

  1. School of Information,Central University of Finance and Economics,Beijing 100081,China1;
    School of Statistics,Renmin University of China,Beijing 100872,China2
  • Received:2017-06-25 Online:2018-07-30 Published:2018-07-30

摘要: 聚类内部有效性指标是在未知样本真实分类情况下用于评价聚类结果优劣、寻找最佳聚类个数的指标,是聚类分析研究中的重要内容。虽然已有大量的研究分析了聚类内部有效性指标的性能,且有研究结论表明某些内部有效性指标的性能良好,能够辅助聚类算法找到最佳聚类个数,但这些研究未考虑真实数据中的噪音特征对内部有效性指标的影响,研究结论可能会误导内部有效性指标的选取和应用。为此,选取了10种常用的内部有效性指标来研究噪音特征对内部有效性特征选择和聚类结果的影响。结果表明,数据中的噪音特征会影响内部有效性指标的性能,除KL指标、CH指标和CCC指标对噪音特征的反应相对不敏感外,其他内部有效性指标均对噪音特征敏感,且聚类结果的准确性会随着噪音的增强而降低。

关键词: 聚类个数, 聚类准确度, 内部有效性, 噪音特征

Abstract: Internal validation measures of clustering are extremly essentialin clustering analysis,and they are used to evaluate the effect of clustering results and are indicators to find the optimal cluster number when the true situation of sample is unknown.Although a large number of studies focus on the performance of internal validation measures of clustering and have found that some measures perform better than others,they ignore the influence of noisy features existing in real data.Therefore,it may mislead the selection and application of internal validation measures of clustering.This study selected 10 clustering validation measures to determine the number of clusters of simulation datasets and real datasets,so as to analyze the influence of noisy features on internal validation choosing and clustering results.Results indicate that noisy features among dataset have impact on all internal validation indices of clustering but KL,CH and CCC,and accuracy of the clustering results will decrease along with the increase of noise.

Key words: Clustering accuracy, Internal validation, Noisy features, Number of clusters

中图分类号: 

  • TP391
[1]LEE J M,SONNHAMMER E L.Genomic gene clustering ana-lysis of pathways in eukaryotes.Genome Research,2003,13(5):875-882.
[2]ZASLAVSKY L,CIUFO S,FEDOROV B,et al.Clusteringanalysis of proteins from microbial genomes at multiple levels of resolution.Bmc Bioinformatics,2016,17(8):545-552.
[3]LI X,HIPEL K W,DANG Y.An improved grey relational ana-lysis approach for panel data clustering.Oxford:Pergamon Press,Inc.2015.
[4]ARBELAITZ O,GURRUTXAGA I,MUGUERZA J,et al.An extensive comparative study of cluster validity indices.Pattern Recognition,2013,46(1):243-256.
[5]BEN-DAVID S,LUXBURG U V, P L D.A Sober Look at Clustering Stability.Lecture Notes in Computer Science,2006, 4005:5-19.
[6]SALEM S A,NANDI A K.Development of assessment criteria for clustering algorithms.Berlin:Springer-Verlag,2009.
[7]BOLSHAKOVA N,AZUAJE F,CUNNINGHAM P.A know-ledge-driven approach to cluster validity assessment.Bioinformatics,2005,21(10):2546-2547.
[8]YUE S,WANG J,WANG J,et al.A new validity index for eva-luating the clustering results by partitional clustering algorithms.Soft Computing,2016,20(3):1127-1138.
[9]CHAWLA N.Discovering Knowledge in Data:An Introduction to Data Mining.Publications of the American Statistical Association,2014,100(472):1465-1465.
[10]ZHAO Y,KARYPIS G.Evaluation of hierarchical clustering algorithms for document datasets∥Eleventh International Conference on Information & Knowledge Management.ACM,2002:515-524.
[11]LIU Y,LI Z,XIONG H,et al.Understanding of Internal Clustering Validation Measures∥IEEE,International Conference on Data Mining.IEEE,2011:911-916.
[12]GIANCARLO R,UTRO F.Algorithmic paradigms for stability-based cluster validity and model selection statistical methods,with applications to microarray data analysis.Theoretical Computer Science,2012,428(6):58-79.
[13]GURRUTXAGA I,MUGUERZA J,ARBELAITZ O.Towards a standard methodology to evaluate internal cluster validity indices.Pattern Recognition Letters,2011,32(3):505-515.
[14]JIANG D,TANG C,ZHANG A.Cluster analysis for gene expression data:a survey.IEEE Transactions on Knowledge & Data Engineering,2004,16(11):1370-1386.
[15]SMYTH C,COOMANS D,EVERINGHAM Y.Clustering noisy data in a reduced dimension space via multivariate regression trees.Pattern Recognition,2006,39(3):424-431.
[16]DUNNA^ J C.Well-Separated Clusters and Optimal Fuzzy Partitions.Journal of Cybernetics,1974,4(1):95-104.
[17]CALIN′SKI T,HARABASZ J.A dendrite method for clusteranalysis.Communications in Statistics,1974,3(1):1-27.
[1] 庞天杰,赵兴旺.
一种基于先验信息的混合数据聚类个数确定算法
Algorithm to Determine Number of Clusters for Mixed Data Based on Prior Information
计算机科学, 2016, 43(2): 101-104. https://doi.org/10.11896/j.issn.1002-137X.2016.02.023
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!