一种基于聚类融合欠抽样的不平衡数据分类方法

摘要/Abstract

摘要： 在面对现实中广泛存在的不平衡数据分类问题时,大多数传统分类算法假定数据集类分布是平衡的,分类结果偏向多数类,效果不理想。为此,提出了一种基于聚类融合欠抽样的改进AdaBoost分类算法。该算法首先进行聚类融合,根据样本权值从每个簇中抽取一定比例的多数类和全部的少数类组成平衡数据集。使用AdaBoost算法框架,对多数类和少数类的错分类给予不同的权重调整,选择性地集成分类效果较好的几个基分类器。实验结果表明,该算法在处理不平衡数据分类上具有一定的优势。

关键词: 机器学习,不平衡数据,聚类融合,欠抽样,集成学习

Abstract: Imbalanced data exists widely in the real world,under such circumstances,most traditional classification algorithms assume the balanced data distribution,which results in the classification outcome offset to the majority class,so the effort is not ideal.The enhanced AdaBoost based on the clustering ensemble under-sampling technique was proposed in this paper.The algorithm firstly clusters the sample data by clustering ensemble,according to the sample weight.And the majority class from each cluster in certain proportion are randomly selected and then merge with all minority class to generate a balanced training set.By use of the AdaBoost algorithm framework,the algorithm gives different weight adjustment to the majority class and the minority class respectively,and selectes several base classifiers with better effect to get the final ensemble.The experiment result show that:this algorithm has a certain advantage dealing with unbalanced data classification.

Key words: Machine learning,Imbalanced data,Clustering ensemble,Under-sampling,Ensemble learning

张枭山,罗强. 一种基于聚类融合欠抽样的不平衡数据分类方法[J]. 计算机科学, 2015, 42(Z11): 63-66. https://doi.org/

ZHANG Xiao-shan and LUO Qiang. Unbalanced Data Classification Algorithm Based on Clustering Ensemble Under-sampling[J]. Computer Science, 2015, 42(Z11): 63-66. https://doi.org/

参考文献

[1] He H,Garcia E A.Learning from imbalanced data[J].IEEETransactions on Knowledge and Data Engineering,2009,21(9):1263-1284
[2] Chan P K,Stolfo S J.Toward Scalable Learning with Non-Uniform Class and Cost Distributions:A Case Study in Credit Card Fraud Detection[C]∥KDD.1998:164-168
[3] Kubat M,Holte R C,Matwin S.Machine learning for the detection of oil spills in satellite radar images[J].Machine learning,1998,30(2/3):195-215
[4] Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16(1):321-357
[5] Han H,Wang W Y,Mao B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning [M]∥Advances in intelligent computing.Springer Berlin Heidelberg,2005:878-887
[6] 刘余霞,刘三民,刘涛,等.一种新的过采样算法 DB_SMOTE[J].计算机工程与应用,2014,50(6):92-95
[7] Kubat M,Matwin S.Addressing the curse of imbalanced training sets:one-sided selection[C]∥ICML.1997:179-186
[8] 程险峰,李军,李雄飞.一种基于欠采样的不平衡数据分类算法[J].计算机工程,2011,37(13):147-149
[9] Yen S J,Lee Y S.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems with Applications,2009,36(3):5718-5727
[10] Freund Y,Schapire R E.A decision-theoretic generalization ofon-line learning and an application to boosting[J].Journal of computer and system sciences,1997,55(1):119-139
[11] Sun Y,Kamel M S,Wong A K C,et al.Cost-sensitive boosting for classification of imbalanced data[J].Pattern Recognition,2007,40(12):3358-3378
[12] Seiffert C,Khoshgoftaar T M,Van Hulse J,et al.RUSBoost:improving classification performance when training data is skewed[C]∥19th International Conference on Pattern Recognition,2008(ICPR 2008).IEEE,2008:1-4
[13] Ditterrich T G.Machine learning research:four current direction[J].Artificial Intelligence Magzine,1997,18(4):97-136
[14] Chawla N V,Lazarevic A,Hall L O,et al.SMOTEBoost:Improving prediction of the minority class in boosting[M]∥Know-ledge Discovery in Databases(PKDD 2003).Springer Berlin Heidelberg,2003:107-119
[15] 李雄飞,李军,董元方,等.一种新的不平衡数据学习算法 PCBoost[J].计算机学报,2012,35(2):202-209
[16] Minaei-Bidgoli B,Topchy A P,Punch W F.A Comparison of Resampling Methods for Clustering Ensembles[C]∥IC-AI.2004:939-945
[17] Hadjitodorov S T,Kuncheva L I,Todorova L P.Moderate diversity for better cluster ensembles[J].Information Fusion,2006,7(3):264-275
[18] Strehl A,Ghosh J.Cluster ensembles－a knowledge reuseframework for combining multiple partitions[J].The Journal of Machine Learning Research,2003,3:583-617
[19] Fred A L N,Jain A K.Data clustering using evidence accumulation[C]∥Proceedings 16th International Conference on Pattern Recognition,2002.IEEE,2002:276-280
[20] Topchy A,Jain A K,Punch W.A mixture model of clustering ensembles[C]∥Proc.SIAM Intl.Conf.on Data Mining.2004
[21] MacQueen J.Some methods for classification and analysis ofmultivariate observations[J].Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,1967,1(14):281-297
[22] Fred A.Finding consistent clusters in data partitions[M]∥Multiple classifier systems.Springer Berlin Heidelberg,2001:309-318

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed