Computer Science ›› 2016, Vol. 43 ›› Issue (7): 245-250.doi: 10.11896/j.issn.1002-137X.2016.07.044

Previous Articles     Next Articles

Multi-clusters IB Algorithm for Imbalanced Data Set

JIANG Peng, YE Yang-dong and LOU Zheng-zheng   

  • Online:2018-12-01 Published:2018-12-01

Abstract: When dealing with imbalanced data sets,the original IB method tends to produce clusters of relatively uniform size,resulting in the problem of unsatisfactory clustering effect.To solve this problem,this paper proposesd a multi-clusters information bottleneck (McIB) algorithm.McIB algorithm tries to reduce the skewness of the data distributions by under-sampling method to divide the imbalanced data sets into multiple relatively uniform size clusters.Entire algorithm consists of three steps.First,a dividing measurement standard is proposed to determine the sampling ratio parameter.Second,McIB algorithm preliminary analyses the data to generate reliable multi-clusters.At last,McIB algorithm merges clusters into one bigger size cluster according to the similarity between clusters and organizes multiple clusters representing the actual cluster to obtain the final clustering results.Experimental results show that the McIB algorithm can effectively mine the pattern resided in imbalanced data sets.Compared with other common clustering algorithms,the performance of the McIB algorithm is better.

Key words: Clustering,Information bottleneck method,Imbalanced data,Multi-clusters,Cluster merging

[1] He H,Garcia E A.Learning from imbalanced data[J].IEEETransactions on Knowledge and Data Engineering,2009,21(9):1263-1284
[2] Longadge R,Dongre S.Class Imbalance Problem in Data Mi-ning:Review[J].International Journal of Computer Science and Network,2013,2(1):83-87
[3] Chawla Nitesh V.Data mining for imbalanced datasets:An over-view[M]∥Data Mining and Knowledge Discovery Handbook,2005.US:Springer,2005:853-867
[4] Provost F.Machine learning from imbalanced data sets 101[C]∥Proceedings of the AAAI’2000 Workshop on Imbalanced Data Sets,2000.2000:1-3
[5] Zhi Wei-mei,Guo Hua-ping,Fan Ming,et al.Disscussion onclassfication for imbalanced Data sets[J].Computer Science,2012,39(S1):304-308(in Chinese) 职为梅,郭华平,范明,等.非平衡数据集分类方法探讨[J].计算机科学,2012,39(S1):304-308
[6] Kumar C N S,Rao K N,Govardhan A,et al.Imbalanced K-Means:An algorithm to cluster imbalanced-distributed data[J].International Journal of Engineering and Technical Research,2014,2(2):114-122
[7] Jain A K,Dubes R C.Algorithms for clustering data[M].Englewood Cliffs:Prentice hall,1988
[8] Nguyen C H,Ho T B.An imbalanced data rule learner[C]∥Knowledge Discovery in Databases:PKDD 2005.Berlin:Sprin-ger,2005:617-624
[9] MacQueen J.Some methods for classification and analysis of mul-tivariate observations[C]∥Proceedings of Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Pro-bability,1967.Oakland:CA,1967:281-297
[10] Abolkarlou N A,Niknafs A A,Ebrahimpour M K.Ensembleimbalance classification:Using data preprocessing,clustering algorithm and genetic algorithm[C]∥2014 4th International Conference on Computer and Knowledge Engineering (ICCKE).IEEE,2014:171-176
[11] Yen S-J,Lee Y-S.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems with Applications,2009,36(3):5718-5727
[12] Zhang Y P,Zhang L N,Wang Y C.Cluster-based majority under-sampling approaches for class imbalance learning[C]∥2010 2nd IEEE International Conference on Information and Financial Engineering (ICIFE).IEEE,2010:400-404
[13] Xiong H,Wu J,Chen J.K-means clustering versus validationmeasures:a data-distribution perspective[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(2):318-331
[14] Liang J,Bai L,Dang C,et al.The-Means-Type Algorithms Versus Imbalanced Data Distributions[J].IEEE Transactions on Fuzzy Systems,2012,20(4):728-745
[15] Qian J,Saligrama V.Spectral clustering with imbalanced data[C]∥2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014:3057-3061
[16] Prachuabsupakij W,Soonthornphisaj N.Cluster-based sampling of multiclass imbalanced data[J].Intelligent Data Analysis,2014,18(6):1109-1135
[17] Tishby N,Pereira F C,Bialek W.The information bottleneck method[C]∥Proceedings of 37th Allerton Conference on Communication,Control and Computering,1999.1999:368-377
[18] Cover T M,Thomas J A.Elements of information theory[M].New York:John Wiley & Sons,2012
[19] Slonim N.The information bottleneck:Theory and applications[D].Jerusalem:The Hebrew University of Jerusalem,2002
[20] Slonim N,Tishby N.Document clustering using word clustersvia the information bottleneck method[C]∥Proceedings of Proceedings of the 23rd Annual International ACM SIGIR Confe-rence on Research and Development in Information Retrieval,2000.ACM,2000:208-215
[21] Lou Zheng-zheng,Yang Chen,Ye Yang-dong.An IB algorithm based on data selection model[J].Acta Electronica Sinica,2014,42(9):1839-1846(in Chinese) 娄铮铮,杨晨,叶阳东.基于数据选择模型的 IB 算法[J].电子学报,2014,42(9):1839-1846
[22] DeGroot M H,Schervish M J,Fang X,et al.Probability and statistics[M].MA:Addison-Wesley Reading,1986
[23] Alcala-Fdez J,Sanchez L,Garcia S,et al.KEEL:a software tool to assess evolutionary algorithms for data mining problems[J].Soft Computing,2009,13(3):307-318
[24] Shi J,Malik J.Normalized cuts and image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,2(8):888-905
[25] Yang Y.An evaluation of statistical approaches to text categorization[J].Information Retrieval,1999,1(1/2):69-90
[26] Manning C D,Raghavan P,Schütze H.Introduction to information retrieval[M].Cambridge:Cambridge University Press,2008

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!