两层聚类的类别不平衡数据挖掘算法

Abstract

Abstract: Classification of class-imbalanced data becomes a research hot topic in machine learning and data mining．Most classification algorithms tend to predict that most of the incoming data belongs to the majority class,resulting in the pool classification performance in minority class instances,which are usually much more of interest．In this paper,a two-tier clustering cascading mining algorithm was proposed．The algorithm first constructs balanced training set by clusterd-based under-sampling,using K-means clustering to cluster majority class and extract cluster centroids then merge with all minority class instances to generate a balanced training set for training．To avoid the number of the minority is too small,leading the shortage of training instance,combination of SMOTE over-sampling and cluster-based under-sampling is used；next,using “K-means＋C4.5”,a method to cascade K-means clustering and C4.5decision tree algorithm for classifying on the balanced training set,the K-means clustering method is first used to parition the training instances into k clusters,and on each cluster,C4.5algorithm is used to build decision tree,the decision tree on each cluster refines the decision boundaries by learning the subgroups within the cluster．Experimental results show that the proposed method provides better classification performance than other approaches on both minority and majority classes,and is effective and feasible to deal with the imbalanced datasets．

Key words: Data mining,Classification,Imbalanced data,K-means clustering

HU Xiao-sheng,ZHANG Run-jing and ZHONG Yong. Two-tier Clustering for Mining Imbalanced Datasets[J].Computer Science, 2013, 40(11): 271-275.

References

[1] Chawla N V,Bowyer K,Hall L,et al．SMOTE:Synthetic Mino-rity Over-sampling Technique[J]．Journal of Artificial Intelligence Research,2002,16(1):321-357
[2] Tomek I．Two modifications of CNN[J]．IEEE Transaction on Systems,Man and Communications,1976,26(1):769-772
[3] Kermanidis K,Maragoundakis K,Fakotakis N,et al．Learning greekverb complements:addressing the class imbalance[C]∥Procee-dings of the 20th International Conference on Computational Linguistics．Geneva,Switzerland,2004:1065-1071
[4] Yen Show-jane,Lee Yue-shi．Under-sampling approaches forimproving prediction of the minority class in an imbalanced dataset[C]∥Proceedings of Intelligent Control and Automation,Series:Lecture Notes in Control and Information Sciences．Berlin/Heidelberg:Springer,2006:731-740
[5] 蒋盛益,苗邦,余雯．基于一趟聚类的不平衡数据下抽样算法[J]．小型微型计算机系统,2012,3(2):232-236
[6] 李雄飞,李军,屈成伟,等．数据挖掘中平衡偏斜训练集的方法研究[J]．计算机研究与发展,2012,49(2):346-353
[7] 韩敏,朱新荣．不平衡数据分类的混合方法[J]．控制理论与应用,2011,8(10):1485-1489
[8] 刘胥影,吴建鑫,周志华．一种基于级联模型的类别不平衡数据分类方法[J]．南京大学学报:自然科学版,2006,2(2):148-155
[9] Tang Y,Zhang Y Q,Chawla N V,et al．SVMs modeling forhighly imbalanced classifications[J]．IEEE Transaction on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(1):281-288
[10] 凌晓峰,Sheng V S．代价敏感分类器的比较研究[J]．计算机学报,2007,30(8):1203-1212
[11] 翟云,杨炳儒,曲武．不平衡类数据挖掘研究综述[J]．计算机科学,2010,37(10):27-32
[12] Ertekin S,Huang J,Bottou L,et al．Learning on the border:active learning in imbalanced data classification[C]∥Proceedings of the ACM Conference on Information and Knowledge Management．Lisbon,Portugal,2007:127-136
[13] 井小沛,汪厚祥,聂凯．一基于修正核函数SVM的网络入侵检测[J]．系统工程与电子技术,2012,34(5):1036-1039
[14] 李雄飞,李军,董元方,等．一种新的不平衡数据学习算法PCBoost[J]．计算机学报,2012,35(2):202-209
[15] 林智勇,郝志峰,杨晓伟．若干评价准则对不平衡数据学习的影响[J]．华南理工大学学报:自然科学版,2010,4(38):126-135

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Two-tier Clustering for Mining Imbalanced Datasets

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0