计算机科学 ›› 2013, Vol. 40 ›› Issue (11): 271-275.

• 人工智能 • 上一篇    下一篇

两层聚类的类别不平衡数据挖掘算法

胡小生,张润晶,钟勇   

  1. 佛山科学技术学院电子与信息工程学院 佛山528000;佛山科学技术学院信息与教育技术中心 佛山528000;佛山科学技术学院电子与信息工程学院 佛山528000
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受佛山市科技发展专项资金项目(2011AA100061),佛山市产学研专项资金项目(2012HC100272),佛山市教育局智能教育评价指标体系研究项目(DX20120220)资助

Two-tier Clustering for Mining Imbalanced Datasets

HU Xiao-sheng,ZHANG Run-jing and ZHONG Yong   

  • Online:2018-11-16 Published:2018-11-16

摘要: 类别不平衡数据分类是机器学习和数据挖掘研究的热点问题。传统分类算法有很大的偏向性,少数类分类效果不够理想。提出一种两层聚类的类别不平衡数据级联挖掘算法。算法首先进行基于聚类的欠采样,在多数类样本上进行聚类,之后提取聚类质心,获得与少数类样本数目相一致的聚类质心,再与所有少数类样例一起组成新的平衡训练集,为了避免少数类样本数量过少而使训练集过小导致分类精度下降的问题,使用SMOTE过采样结合聚类欠采样;然后在平衡的训练集上使用K均值聚类与C4.5决策树算法相级联的分类方法,通过K均值聚类将训练样例划分为K个簇,在每个聚类簇内使用C4.5算法构建决策树,通过K个聚簇上的决策树来改进优化分类决策边界。实验结果表明,该算法具有处理类别不平衡数据分类问题的优势。

关键词: 数据挖掘,分类,不平衡数据,K均值聚类

Abstract: Classification of class-imbalanced data becomes a research hot topic in machine learning and data mining.Most classification algorithms tend to predict that most of the incoming data belongs to the majority class,resulting in the pool classification performance in minority class instances,which are usually much more of interest.In this paper,a two-tier clustering cascading mining algorithm was proposed.The algorithm first constructs balanced training set by clusterd-based under-sampling,using K-means clustering to cluster majority class and extract cluster centroids then merge with all minority class instances to generate a balanced training set for training.To avoid the number of the minority is too small,leading the shortage of training instance,combination of SMOTE over-sampling and cluster-based under-sampling is used;next,using “K-means+C4.5”,a method to cascade K-means clustering and C4.5decision tree algorithm for classifying on the balanced training set,the K-means clustering method is first used to parition the training instances into k clusters,and on each cluster,C4.5algorithm is used to build decision tree,the decision tree on each cluster refines the decision boundaries by learning the subgroups within the cluster.Experimental results show that the proposed method provides better classification performance than other approaches on both minority and majority classes,and is effective and feasible to deal with the imbalanced datasets.

Key words: Data mining,Classification,Imbalanced data,K-means clustering

[1] Chawla N V,Bowyer K,Hall L,et al.SMOTE:Synthetic Mino-rity Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357
[2] Tomek I.Two modifications of CNN[J].IEEE Transaction on Systems,Man and Communications,1976,26(1):769-772
[3] Kermanidis K,Maragoundakis K,Fakotakis N,et al.Learning greekverb complements:addressing the class imbalance[C]∥Procee-dings of the 20th International Conference on Computational Linguistics.Geneva,Switzerland,2004:1065-1071
[4] Yen Show-jane,Lee Yue-shi.Under-sampling approaches forimproving prediction of the minority class in an imbalanced dataset[C]∥Proceedings of Intelligent Control and Automation,Series:Lecture Notes in Control and Information Sciences.Berlin/Heidelberg:Springer,2006:731-740
[5] 蒋盛益,苗邦,余雯.基于一趟聚类的不平衡数据下抽样算法[J].小型微型计算机系统,2012,3(2):232-236
[6] 李雄飞,李军,屈成伟,等.数据挖掘中平衡偏斜训练集的方法研究[J].计算机研究与发展,2012,49(2):346-353
[7] 韩敏,朱新荣.不平衡数据分类的混合方法[J].控制理论与应用,2011,8(10):1485-1489
[8] 刘胥影,吴建鑫,周志华.一种基于级联模型的类别不平衡数据分类方法[J].南京大学学报:自然科学版,2006,2(2):148-155
[9] Tang Y,Zhang Y Q,Chawla N V,et al.SVMs modeling forhighly imbalanced classifications[J].IEEE Transaction on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(1):281-288
[10] 凌晓峰,Sheng V S.代价敏感分类器的比较研究[J].计算机学报,2007,30(8):1203-1212
[11] 翟云,杨炳儒,曲武.不平衡类数据挖掘研究综述[J].计算机科学,2010,37(10):27-32
[12] Ertekin S,Huang J,Bottou L,et al.Learning on the border:active learning in imbalanced data classification[C]∥Proceedings of the ACM Conference on Information and Knowledge Management.Lisbon,Portugal,2007:127-136
[13] 井小沛,汪厚祥,聂凯.一基于修正核函数SVM的网络入侵检测[J].系统工程与电子技术,2012,34(5):1036-1039
[14] 李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):202-209
[15] 林智勇,郝志峰,杨晓伟.若干评价准则对不平衡数据学习的影响[J].华南理工大学学报:自然科学版,2010,4(38):126-135

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!