Computer Science ›› 2018, Vol. 45 ›› Issue (7): 22-30.doi: 10.11896/j.issn.1002-137X.2018.07.004

• CCF Big Data 2017 • Previous Articles     Next Articles

Influence of Noisy Features on Internal Validation of Clustering

YANG Hu1,FU Yu2,FAN Dan1   

  1. School of Information,Central University of Finance and Economics,Beijing 100081,China1;
    School of Statistics,Renmin University of China,Beijing 100872,China2
  • Received:2017-06-25 Online:2018-07-30 Published:2018-07-30

Abstract: Internal validation measures of clustering are extremly essentialin clustering analysis,and they are used to evaluate the effect of clustering results and are indicators to find the optimal cluster number when the true situation of sample is unknown.Although a large number of studies focus on the performance of internal validation measures of clustering and have found that some measures perform better than others,they ignore the influence of noisy features existing in real data.Therefore,it may mislead the selection and application of internal validation measures of clustering.This study selected 10 clustering validation measures to determine the number of clusters of simulation datasets and real datasets,so as to analyze the influence of noisy features on internal validation choosing and clustering results.Results indicate that noisy features among dataset have impact on all internal validation indices of clustering but KL,CH and CCC,and accuracy of the clustering results will decrease along with the increase of noise.

Key words: Clustering accuracy, Internal validation, Noisy features, Number of clusters

CLC Number: 

  • TP391
[1]LEE J M,SONNHAMMER E L.Genomic gene clustering ana-lysis of pathways in eukaryotes.Genome Research,2003,13(5):875-882.
[2]ZASLAVSKY L,CIUFO S,FEDOROV B,et al.Clusteringanalysis of proteins from microbial genomes at multiple levels of resolution.Bmc Bioinformatics,2016,17(8):545-552.
[3]LI X,HIPEL K W,DANG Y.An improved grey relational ana-lysis approach for panel data clustering.Oxford:Pergamon Press,Inc.2015.
[4]ARBELAITZ O,GURRUTXAGA I,MUGUERZA J,et al.An extensive comparative study of cluster validity indices.Pattern Recognition,2013,46(1):243-256.
[5]BEN-DAVID S,LUXBURG U V, P L D.A Sober Look at Clustering Stability.Lecture Notes in Computer Science,2006, 4005:5-19.
[6]SALEM S A,NANDI A K.Development of assessment criteria for clustering algorithms.Berlin:Springer-Verlag,2009.
[7]BOLSHAKOVA N,AZUAJE F,CUNNINGHAM P.A know-ledge-driven approach to cluster validity assessment.Bioinformatics,2005,21(10):2546-2547.
[8]YUE S,WANG J,WANG J,et al.A new validity index for eva-luating the clustering results by partitional clustering algorithms.Soft Computing,2016,20(3):1127-1138.
[9]CHAWLA N.Discovering Knowledge in Data:An Introduction to Data Mining.Publications of the American Statistical Association,2014,100(472):1465-1465.
[10]ZHAO Y,KARYPIS G.Evaluation of hierarchical clustering algorithms for document datasets∥Eleventh International Conference on Information & Knowledge Management.ACM,2002:515-524.
[11]LIU Y,LI Z,XIONG H,et al.Understanding of Internal Clustering Validation Measures∥IEEE,International Conference on Data Mining.IEEE,2011:911-916.
[12]GIANCARLO R,UTRO F.Algorithmic paradigms for stability-based cluster validity and model selection statistical methods,with applications to microarray data analysis.Theoretical Computer Science,2012,428(6):58-79.
[13]GURRUTXAGA I,MUGUERZA J,ARBELAITZ O.Towards a standard methodology to evaluate internal cluster validity indices.Pattern Recognition Letters,2011,32(3):505-515.
[14]JIANG D,TANG C,ZHANG A.Cluster analysis for gene expression data:a survey.IEEE Transactions on Knowledge & Data Engineering,2004,16(11):1370-1386.
[15]SMYTH C,COOMANS D,EVERINGHAM Y.Clustering noisy data in a reduced dimension space via multivariate regression trees.Pattern Recognition,2006,39(3):424-431.
[16]DUNNA^ J C.Well-Separated Clusters and Optimal Fuzzy Partitions.Journal of Cybernetics,1974,4(1):95-104.
[17]CALIN′SKI T,HARABASZ J.A dendrite method for clusteranalysis.Communications in Statistics,1974,3(1):1-27.
[1] CUI Guo-nan, WANG Li-song, KANG Jie-xiang, GAO Zhong-jie, WANG Hui, YIN Wei. Fuzzy Clustering Validity Index Combined with Multi-objective Optimization Algorithm and Its Application [J]. Computer Science, 2021, 48(10): 197-203.
[2] CHEN Jun-fen, ZHANG Ming, HE Qiang. Heuristically Determining Cluster Numbers Based NJW Spectral Clustering Algorithm [J]. Computer Science, 2018, 45(11A): 474-479.
[3] ZHOU Shi-bing,XU Zhen-yuan,TANG Xu-qing. Comparative Study on Method for Determining Optimal Number of Clusters Based on Affinity Propagation Clustering [J]. Computer Science, 2011, 38(2): 225-228.
[4] . [J]. Computer Science, 2007, 34(2): 207-210.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!