计算机科学 ›› 2016, Vol. 43 ›› Issue (7): 245-250.doi: 10.11896/j.issn.1002-137X.2016.07.044

• 人工智能 • 上一篇    下一篇

一种面向非平衡数据的多簇IB算法

江鹏,叶阳东,娄铮铮   

  1. 郑州大学信息工程学院 郑州450052,郑州大学信息工程学院 郑州450052,郑州大学信息工程学院 郑州450052
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目:多变量IB方法及算法的研究(61170223),国家自然科学基金联合基金项目:可扩展迁移学习中跨媒体复杂问题自动映射研究(U1204610)资助

Multi-clusters IB Algorithm for Imbalanced Data Set

JIANG Peng, YE Yang-dong and LOU Zheng-zheng   

  • Online:2018-12-01 Published:2018-12-01

摘要: 信息瓶颈(Information Bottleneck,IB)方法在处理非平衡数据集时,倾向于将大簇中的数据对象划分到数据规模较小的小簇中,造成了聚类效果不理想的问题。针对该问题,提出了一种面向非平衡数据的多簇信息瓶颈算法(McIB)。McIB算法采用向下抽样方法来降低非平衡数据集的倾斜度,使用先划分再学习后合并的策略来优化IB算法处理非平衡数据的合并抽取过程。整个算法包含3步:首先根据分离标准来确定抽样比例参数;然后对数据进行初步的聚类,生成可信赖的多个簇;最后再利用簇之间的相似性对簇进行合并,组织多个簇代表每个实际的簇来得到最终的聚类结果。实验结果表明:所提算法能够有效地解决IB方法在非平衡数据集上的“均匀效应”问题;与其他聚类算法相比,McIB算法的性能更优。

关键词: 聚类,IB算法,非平衡数据,多簇,簇合并

Abstract: When dealing with imbalanced data sets,the original IB method tends to produce clusters of relatively uniform size,resulting in the problem of unsatisfactory clustering effect.To solve this problem,this paper proposesd a multi-clusters information bottleneck (McIB) algorithm.McIB algorithm tries to reduce the skewness of the data distributions by under-sampling method to divide the imbalanced data sets into multiple relatively uniform size clusters.Entire algorithm consists of three steps.First,a dividing measurement standard is proposed to determine the sampling ratio parameter.Second,McIB algorithm preliminary analyses the data to generate reliable multi-clusters.At last,McIB algorithm merges clusters into one bigger size cluster according to the similarity between clusters and organizes multiple clusters representing the actual cluster to obtain the final clustering results.Experimental results show that the McIB algorithm can effectively mine the pattern resided in imbalanced data sets.Compared with other common clustering algorithms,the performance of the McIB algorithm is better.

Key words: Clustering,Information bottleneck method,Imbalanced data,Multi-clusters,Cluster merging

[1] He H,Garcia E A.Learning from imbalanced data[J].IEEETransactions on Knowledge and Data Engineering,2009,21(9):1263-1284
[2] Longadge R,Dongre S.Class Imbalance Problem in Data Mi-ning:Review[J].International Journal of Computer Science and Network,2013,2(1):83-87
[3] Chawla Nitesh V.Data mining for imbalanced datasets:An over-view[M]∥Data Mining and Knowledge Discovery Handbook,2005.US:Springer,2005:853-867
[4] Provost F.Machine learning from imbalanced data sets 101[C]∥Proceedings of the AAAI’2000 Workshop on Imbalanced Data Sets,2000.2000:1-3
[5] Zhi Wei-mei,Guo Hua-ping,Fan Ming,et al.Disscussion onclassfication for imbalanced Data sets[J].Computer Science,2012,39(S1):304-308(in Chinese) 职为梅,郭华平,范明,等.非平衡数据集分类方法探讨[J].计算机科学,2012,39(S1):304-308
[6] Kumar C N S,Rao K N,Govardhan A,et al.Imbalanced K-Means:An algorithm to cluster imbalanced-distributed data[J].International Journal of Engineering and Technical Research,2014,2(2):114-122
[7] Jain A K,Dubes R C.Algorithms for clustering data[M].Englewood Cliffs:Prentice hall,1988
[8] Nguyen C H,Ho T B.An imbalanced data rule learner[C]∥Knowledge Discovery in Databases:PKDD 2005.Berlin:Sprin-ger,2005:617-624
[9] MacQueen J.Some methods for classification and analysis of mul-tivariate observations[C]∥Proceedings of Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Pro-bability,1967.Oakland:CA,1967:281-297
[10] Abolkarlou N A,Niknafs A A,Ebrahimpour M K.Ensembleimbalance classification:Using data preprocessing,clustering algorithm and genetic algorithm[C]∥2014 4th International Conference on Computer and Knowledge Engineering (ICCKE).IEEE,2014:171-176
[11] Yen S-J,Lee Y-S.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems with Applications,2009,36(3):5718-5727
[12] Zhang Y P,Zhang L N,Wang Y C.Cluster-based majority under-sampling approaches for class imbalance learning[C]∥2010 2nd IEEE International Conference on Information and Financial Engineering (ICIFE).IEEE,2010:400-404
[13] Xiong H,Wu J,Chen J.K-means clustering versus validationmeasures:a data-distribution perspective[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(2):318-331
[14] Liang J,Bai L,Dang C,et al.The-Means-Type Algorithms Versus Imbalanced Data Distributions[J].IEEE Transactions on Fuzzy Systems,2012,20(4):728-745
[15] Qian J,Saligrama V.Spectral clustering with imbalanced data[C]∥2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014:3057-3061
[16] Prachuabsupakij W,Soonthornphisaj N.Cluster-based sampling of multiclass imbalanced data[J].Intelligent Data Analysis,2014,18(6):1109-1135
[17] Tishby N,Pereira F C,Bialek W.The information bottleneck method[C]∥Proceedings of 37th Allerton Conference on Communication,Control and Computering,1999.1999:368-377
[18] Cover T M,Thomas J A.Elements of information theory[M].New York:John Wiley & Sons,2012
[19] Slonim N.The information bottleneck:Theory and applications[D].Jerusalem:The Hebrew University of Jerusalem,2002
[20] Slonim N,Tishby N.Document clustering using word clustersvia the information bottleneck method[C]∥Proceedings of Proceedings of the 23rd Annual International ACM SIGIR Confe-rence on Research and Development in Information Retrieval,2000.ACM,2000:208-215
[21] Lou Zheng-zheng,Yang Chen,Ye Yang-dong.An IB algorithm based on data selection model[J].Acta Electronica Sinica,2014,42(9):1839-1846(in Chinese) 娄铮铮,杨晨,叶阳东.基于数据选择模型的 IB 算法[J].电子学报,2014,42(9):1839-1846
[22] DeGroot M H,Schervish M J,Fang X,et al.Probability and statistics[M].MA:Addison-Wesley Reading,1986
[23] Alcala-Fdez J,Sanchez L,Garcia S,et al.KEEL:a software tool to assess evolutionary algorithms for data mining problems[J].Soft Computing,2009,13(3):307-318
[24] Shi J,Malik J.Normalized cuts and image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,2(8):888-905
[25] Yang Y.An evaluation of statistical approaches to text categorization[J].Information Retrieval,1999,1(1/2):69-90
[26] Manning C D,Raghavan P,Schütze H.Introduction to information retrieval[M].Cambridge:Cambridge University Press,2008

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!