一种基于聚类融合欠抽样的不平衡数据分类方法

Abstract

Abstract: Imbalanced data exists widely in the real world,under such circumstances,most traditional classification algorithms assume the balanced data distribution,which results in the classification outcome offset to the majority class,so the effort is not ideal.The enhanced AdaBoost based on the clustering ensemble under-sampling technique was proposed in this paper.The algorithm firstly clusters the sample data by clustering ensemble,according to the sample weight.And the majority class from each cluster in certain proportion are randomly selected and then merge with all minority class to generate a balanced training set.By use of the AdaBoost algorithm framework,the algorithm gives different weight adjustment to the majority class and the minority class respectively,and selectes several base classifiers with better effect to get the final ensemble.The experiment result show that:this algorithm has a certain advantage dealing with unbalanced data classification.

Key words: Machine learning,Imbalanced data,Clustering ensemble,Under-sampling,Ensemble learning

ZHANG Xiao-shan and LUO Qiang. Unbalanced Data Classification Algorithm Based on Clustering Ensemble Under-sampling[J].Computer Science, 2015, 42(Z11): 63-66.

References

[1] He H,Garcia E A.Learning from imbalanced data[J].IEEETransactions on Knowledge and Data Engineering,2009,21(9):1263-1284
[2] Chan P K,Stolfo S J.Toward Scalable Learning with Non-Uniform Class and Cost Distributions:A Case Study in Credit Card Fraud Detection[C]∥KDD.1998:164-168
[3] Kubat M,Holte R C,Matwin S.Machine learning for the detection of oil spills in satellite radar images[J].Machine learning,1998,30(2/3):195-215
[4] Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16(1):321-357
[5] Han H,Wang W Y,Mao B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning [M]∥Advances in intelligent computing.Springer Berlin Heidelberg,2005:878-887
[6] 刘余霞,刘三民,刘涛,等.一种新的过采样算法 DB_SMOTE[J].计算机工程与应用,2014,50(6):92-95
[7] Kubat M,Matwin S.Addressing the curse of imbalanced training sets:one-sided selection[C]∥ICML.1997:179-186
[8] 程险峰,李军,李雄飞.一种基于欠采样的不平衡数据分类算法[J].计算机工程,2011,37(13):147-149
[9] Yen S J,Lee Y S.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems with Applications,2009,36(3):5718-5727
[10] Freund Y,Schapire R E.A decision-theoretic generalization ofon-line learning and an application to boosting[J].Journal of computer and system sciences,1997,55(1):119-139
[11] Sun Y,Kamel M S,Wong A K C,et al.Cost-sensitive boosting for classification of imbalanced data[J].Pattern Recognition,2007,40(12):3358-3378
[12] Seiffert C,Khoshgoftaar T M,Van Hulse J,et al.RUSBoost:improving classification performance when training data is skewed[C]∥19th International Conference on Pattern Recognition,2008(ICPR 2008).IEEE,2008:1-4
[13] Ditterrich T G.Machine learning research:four current direction[J].Artificial Intelligence Magzine,1997,18(4):97-136
[14] Chawla N V,Lazarevic A,Hall L O,et al.SMOTEBoost:Improving prediction of the minority class in boosting[M]∥Know-ledge Discovery in Databases(PKDD 2003).Springer Berlin Heidelberg,2003:107-119
[15] 李雄飞,李军,董元方,等.一种新的不平衡数据学习算法 PCBoost[J].计算机学报,2012,35(2):202-209
[16] Minaei-Bidgoli B,Topchy A P,Punch W F.A Comparison of Resampling Methods for Clustering Ensembles[C]∥IC-AI.2004:939-945
[17] Hadjitodorov S T,Kuncheva L I,Todorova L P.Moderate diversity for better cluster ensembles[J].Information Fusion,2006,7(3):264-275
[18] Strehl A,Ghosh J.Cluster ensembles－a knowledge reuseframework for combining multiple partitions[J].The Journal of Machine Learning Research,2003,3:583-617
[19] Fred A L N,Jain A K.Data clustering using evidence accumulation[C]∥Proceedings 16th International Conference on Pattern Recognition,2002.IEEE,2002:276-280
[20] Topchy A,Jain A K,Punch W.A mixture model of clustering ensembles[C]∥Proc.SIAM Intl.Conf.on Data Mining.2004
[21] MacQueen J.Some methods for classification and analysis ofmultivariate observations[J].Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,1967,1(14):281-297
[22] Fred A.Finding consistent clusters in data partitions[M]∥Multiple classifier systems.Springer Berlin Heidelberg,2001:309-318

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Unbalanced Data Classification Algorithm Based on Clustering Ensemble Under-sampling

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0