计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 50-55.doi: 10.11896/j.issn.1002-137X.2019.02.008

• 大数据与数据科学 • 上一篇    下一篇

一种基于质心空间的不均衡数据欠采样方法

金旭1, 王磊1, 孙国梓1,2,3, 李华康1,2,3   

  1. 南京邮电大学江苏省大数据安全与智能处理重点实验室 南京2100231
    江西省经济犯罪侦查与防控技术协同创新中心 南昌3301032
    数学工程与先进计算国家重点实验室 江苏 无锡2140003
  • 收稿日期:2017-12-19 出版日期:2019-02-25 发布日期:2019-02-25
  • 通讯作者: 李华康(1982-),男,博士,讲师,CCF会员,主要研究方向为智慧城市、大数据应用、互联网安全,E-mail:huakanglee@163.com。
  • 作者简介:金 旭(1993-),男,硕士生,主要研究方向为自然语言处理;王 磊(1994-),男,硕士生,主要研究方向为自然语言处理和情感分析;孙国梓(1972-),男,博士,教授,CCF会员,主要研究方向为网络空间安全、电子数据取证
  • 基金资助:
    本文受国家自然科学基金资助项目(61502247,11501302,61502243),国家博士后科学基金(2016M600434),江苏省自然科学基金(BK20140895,BK20150862),江苏省博士后科研资助计划(1601128B),江西省经济犯罪侦查与防控技术协同创新中心开放基金资助课题(JXJZXTCXG015)资助.

Under-sampling Method for Unbalanced Data Based on Centroid Space

JIN Xu1, WANG Lei1, SUN Guo-zi1,2,3, LI Hua-kang1,2,3   

  1. Jiangsu Key Laboratory of Big Data Security and Intelligent Processing,Nanjing University of Posts and Telecommunications,Nanjing 210023,China1
    Collaborative Innovation Center for Economics Crime Investigation and Prevention Technology,Nanchang 330103,China2
    State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214000,China3
  • Received:2017-12-19 Online:2019-02-25 Published:2019-02-25

摘要: 针对目前的分类算法在不均衡数据集上的分类效果不理想的问题,将监督学习和无监督学习相结合,提出了一种基于质心的欠采样——ICIKMDS。在现实应用中,一些数据并不容易获得,或者不同类型的数据本身在数量上就存在着差异性,因此造成了数据集分布的不均,如疾病检测中疾病患者和正常人比例的不均、信用卡欺诈中欺诈用户和正常用户比例的不均等。所提方法很好地解决了数据集不均衡的问题,首先通过求解样本之间的欧氏距离得到初始质心,然后采用k-means算法在大类样本集上进行聚类,使不均衡数据集在分布上更加均衡,有效地改善了分类器的分类效果。所提方法使分类器在测试集小类上的分类准确率远远高于随机欠采样和SMOTE算法,在整个测试集上的准确率几乎与其他算法相同。

关键词: k-means, SMOTE算法, 不均衡, 欠采样

Abstract: In view of the fact that the classification performance of current classification algorithms is not ideal for the unbalanced dataset,through combining supervised learning and unsupervised learning,this paper proposed a sub-sampling method based on centroid,namely ICIKMDS.In practical applications,some data are not easily to be obtained or different types of data are different in quantity,resulting in uneven distribution of data,such as the disproportion of the sufferer and the normal people in the detection of diseases,the disproportion of the fraud users and normal users in credit card fraud and so on.The new method solves the disproportion problem of dataset well.In this method,the initial centroid is obtained by solving the Euclidean distance between samples,and then the k-means algorithm is used to cluster the large-class sample sets to make the disproportionate dataset more balanced in distribution,effectively improving the effect of classifiers.The proposed method makes the classification accuracy of the classifier much better than that of random under-sampling and SMOTE algorithm on the subclass of test set,and its accuracy on the whole test set has little difference from other algorithms.

Key words: k-means, SMOTE algorithm, Unbalanced, Under-sampled

中图分类号: 

  • TP181
[1]ZHAI Y,YANG B R,QU W.Overview of Imbalanced Data Mining [J].Computer Science,2010,37(10):27-32.(in Chinese)
翟云,杨炳儒,曲武.不平衡类数据挖掘研究综述[J].计算机科学,2010,37(10):27-32.
[2]VISA S,RALESCU A.Issues in Mining Imbalanced Data Sets-A Review Paper[C]∥ Sixteen Midwest Artificial Intelligence & Cognitive Science Conference.2005:67-73.
[3]XIE N N.Text classification algorithm based on unbalanced data set [D].Chongqing:Chongqing University,2013.(in Chinese)
谢娜娜.基于不均衡数据集的文本分类算法研究[D].重庆:重庆大学,2013.
[4]YOU M Y,CHEN Y,LI G Z.New feature selection algorithm in unbalanced problem Im-IG [J].Journal of Shandong University (Engineering Science),2010,40(5):123-128.(in Chinese)
尤鸣宇,陈燕,李国正.不均衡问题中的特征选择新算法Im-IG[J].山东大学学报(工学版),2010,40(5):123-128.
[5]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[M]∥Huang DS.,Zhang XP.,Huang GB.(eds) Advances in Intelligent Computing.Berlin:Springer,2005:878-887.
[6]FAN X N.Data imbalance classification problem[D].Hefei:University of Science and Technology of China,2011.(in Chinese)
范先念.数据不平衡分类问题研究[D].合肥:中国科学技术大学,2011.
[7]UÑEZ H,GONZALEZ-ABRIL L,ANGULO C.Improving SVM Classification on Imbalanced Datasets by Introducing a New Bias[J].Journal of Classfication,2017,34(3):427-443.
[8]ABDELLATIF S,BEN HASSINE M A,BEN YAHIA S,et al.ARCID:A New Approach to Deal with Imbalanced Datasets Classification[C]∥International Conference on Current Trends in Theory and Practice of Informatics.2018:569-580.
[9]ZHANG Y,FU P P,ZHANG Y T.Large scale data classification based on hierarchical clustering and resampling[J].Journal of Computer Applications,2013,33(10):2801-2803.(in Chinese)
张永,浮盼盼,张玉婷.基于分层聚类及重采样的大规模数据分类[J].计算机应用,2013,33(10):2801-2803.
[10]CHAWLA N,BOWYER K,HALL L,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[11]LONGADGE R,DONGRE S.Class Imbalance Problem in Data Mining:Review[C]∥International Journal of Computer Science and Network,2013,2(1).
[12]GUO Y W,LIU X X.Study on the Method of Information Gain Feature Selection in Chinese Text Classification [J].Computer Engineering and Applications.2012,48(27):119-122.(in Chinese)
郭亚维,刘晓霞.中文文本分类中信息增益特征选择方法的研究[J].计算机工程与应用.2012,48(27):119-122.
[13]YIN L Z,GE Y,XIAO K L,et al.Feature selection for high-dimensional unbalanced data[J].Neurocomputing,2013,105(1):3-11.
[14]BAI P,ZHANG X B,ZHANG B,et al.Support vector machine theory and engineering application examples [M].Xi’an:Xi’an University of Electronic Science and Technology Press,2008:28-30.(in Chinese)
白鹏,张喜斌,张斌,等.支持向量机理论及工程应用实例[M].西安:西安电子科技大学出版社,2008:28-30.
[15]AKBANI R ,KWEK S,JAPKOWICZ N.Applying Support Vector Machines to Imbalanced Datasets[C]∥ European Conference on Machine Learning.Springer Berlin Heidelberg,2004.
[16]DIAO C X.W-SVM Model for Unbalanced Data Set Classification [D].Hefei:Hefei University of Technology,2012.(in Chinese)
刁翠霞.面向不均衡数据集分类的W-SVM模型[D].合肥:合肥工业大学,2012.
[17]GUO H Y,VIKTOR H L.Learning from Imbalanced Data Sets with Boosting and Data Generation:The DataBoost-IM Approach[J].Acm Sigkdd Explorations Newsletter,2004,6(1):30-39.
[1] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[2] 徐守坤, 倪楚涵, 吉晨晨, 李宁.
基于YOLOv3的施工场景安全帽佩戴的图像描述
Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3
计算机科学, 2020, 47(8): 233-240. https://doi.org/10.11896/jsjkx.190600109
[3] 崔巍, 贾晓琳, 樊帅帅, 朱晓燕.
一种新的不均衡关联分类算法
New Associative Classification Algorithm for Imbalanced Data
计算机科学, 2020, 47(6A): 488-493. https://doi.org/10.11896/JsJkx.190600132
[4] 钟雅,郭渊博,刘春辉,李涛.
内部威胁检测中用户属性画像方法与应用
User Attributes Profiling Method and Application in Insider Threat Detection
计算机科学, 2020, 47(3): 292-297. https://doi.org/10.11896/jsjkx.190200379
[5] 焦扬, 杨传颖, 石宝.
基于SVM相关反馈的鞋印图像检索算法
Relevance Feedback Method Based on SVM in Shoeprint Images Retrieval
计算机科学, 2020, 47(11A): 244-247. https://doi.org/10.11896/jsjkx.200400032
[6] 文俊浩,万园,曾骏,王喜宾,梁冠中.
光照度聚类和支持向量机在路灯节能控制策略中的应用
Application of Illumination Clustering and SVM in Energy-saving Control Strategy of Street Lamps
计算机科学, 2019, 46(7): 327-332. https://doi.org/10.11896/j.issn.1002-137X.2019.07.050
[7] 蒋华,武尧,王鑫,王慧娇.
改进K均值聚类的海洋数据异常检测算法研究
Study on Ocean Data Anomaly Detection Algorithm Based on Improved K-means Clustering
计算机科学, 2019, 46(7): 211-216. https://doi.org/10.11896/j.issn.1002-137X.2019.07.032
[8] 刘长齐, 邵堃, 霍星, 范冬阳, 檀结庆.
基于加权质量评价函数的K-means图像分割算法
K-means Image Segmentation Algorithm Based on Weighted Quality Evaluation Function
计算机科学, 2019, 46(6A): 158-160.
[9] 侯媛媛, 何儒汉, 李敏, 陈佳.
结合卷积神经网络多层特征融合和K-Means聚类的服装图像检索方法
Clothing Image Retrieval Method Combining Convolutional Neural Network Multi-layerFeature Fusion and K-Means Clustering
计算机科学, 2019, 46(6A): 215-221.
[10] 黄海燕, 刘晓明, 孙华勇, 杨志才.
聚类分析算法在不确定性决策中的应用
Application of Clustering Analysis Algorithm in Uncertainty Decision Making
计算机科学, 2019, 46(6A): 593-597.
[11] 刘树栋, 魏嘉敏.
基于谱聚类和成对数据表示的多层感知机分类算法
Multilayer Perceptron Classification Algorithm Based on Spectral Clusteringand Simultaneous Two Sample Representation
计算机科学, 2019, 46(11A): 194-198.
[12] 林涛, 赵璨.
最近邻优化的k-means聚类算法
Nearest Neighbor Optimization k-means Clustering Algorithm
计算机科学, 2019, 46(11A): 216-219.
[13] 胡梦琪, 郑继明.
基于量化颜色特征和SURF检测器的图像盲鉴别算法
Blind Image Identification Algorithm Based on HSV Quantized Color Feature and SURF Detector
计算机科学, 2019, 46(11A): 268-272.
[14] 张士翔, 李汪根, 李童, 朱楠楠.
一种改进的贝叶斯逻辑回归核心集构建算法
Improved CoreSets Construction Algorithm for Bayesian Logistic Regression
计算机科学, 2019, 46(11A): 98-102.
[15] 王卫红, 陈骁, 吴炜, 高星宇.
高分影像复杂背景下的城市水体自动提取方法
Method of Automatically Extracting Urban Water Bodies from High-resolution Images with Complex Background
计算机科学, 2019, 46(11): 277-283. https://doi.org/10.11896/jsjkx.181001985
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!