Computer Science ›› 2019, Vol. 46 ›› Issue (2): 50-55.doi: 10.11896/j.issn.1002-137X.2019.02.008

• Big Data & Data Science • Previous Articles     Next Articles

Under-sampling Method for Unbalanced Data Based on Centroid Space

JIN Xu1, WANG Lei1, SUN Guo-zi1,2,3, LI Hua-kang1,2,3   

  1. Jiangsu Key Laboratory of Big Data Security and Intelligent Processing,Nanjing University of Posts and Telecommunications,Nanjing 210023,China1
    Collaborative Innovation Center for Economics Crime Investigation and Prevention Technology,Nanchang 330103,China2
    State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214000,China3
  • Received:2017-12-19 Online:2019-02-25 Published:2019-02-25

Abstract: In view of the fact that the classification performance of current classification algorithms is not ideal for the unbalanced dataset,through combining supervised learning and unsupervised learning,this paper proposed a sub-sampling method based on centroid,namely ICIKMDS.In practical applications,some data are not easily to be obtained or different types of data are different in quantity,resulting in uneven distribution of data,such as the disproportion of the sufferer and the normal people in the detection of diseases,the disproportion of the fraud users and normal users in credit card fraud and so on.The new method solves the disproportion problem of dataset well.In this method,the initial centroid is obtained by solving the Euclidean distance between samples,and then the k-means algorithm is used to cluster the large-class sample sets to make the disproportionate dataset more balanced in distribution,effectively improving the effect of classifiers.The proposed method makes the classification accuracy of the classifier much better than that of random under-sampling and SMOTE algorithm on the subclass of test set,and its accuracy on the whole test set has little difference from other algorithms.

Key words: k-means, SMOTE algorithm, Unbalanced, Under-sampled

CLC Number: 

  • TP181
[1]ZHAI Y,YANG B R,QU W.Overview of Imbalanced Data Mining [J].Computer Science,2010,37(10):27-32.(in Chinese)
翟云,杨炳儒,曲武.不平衡类数据挖掘研究综述[J].计算机科学,2010,37(10):27-32.
[2]VISA S,RALESCU A.Issues in Mining Imbalanced Data Sets-A Review Paper[C]∥ Sixteen Midwest Artificial Intelligence & Cognitive Science Conference.2005:67-73.
[3]XIE N N.Text classification algorithm based on unbalanced data set [D].Chongqing:Chongqing University,2013.(in Chinese)
谢娜娜.基于不均衡数据集的文本分类算法研究[D].重庆:重庆大学,2013.
[4]YOU M Y,CHEN Y,LI G Z.New feature selection algorithm in unbalanced problem Im-IG [J].Journal of Shandong University (Engineering Science),2010,40(5):123-128.(in Chinese)
尤鸣宇,陈燕,李国正.不均衡问题中的特征选择新算法Im-IG[J].山东大学学报(工学版),2010,40(5):123-128.
[5]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[M]∥Huang DS.,Zhang XP.,Huang GB.(eds) Advances in Intelligent Computing.Berlin:Springer,2005:878-887.
[6]FAN X N.Data imbalance classification problem[D].Hefei:University of Science and Technology of China,2011.(in Chinese)
范先念.数据不平衡分类问题研究[D].合肥:中国科学技术大学,2011.
[7]UÑEZ H,GONZALEZ-ABRIL L,ANGULO C.Improving SVM Classification on Imbalanced Datasets by Introducing a New Bias[J].Journal of Classfication,2017,34(3):427-443.
[8]ABDELLATIF S,BEN HASSINE M A,BEN YAHIA S,et al.ARCID:A New Approach to Deal with Imbalanced Datasets Classification[C]∥International Conference on Current Trends in Theory and Practice of Informatics.2018:569-580.
[9]ZHANG Y,FU P P,ZHANG Y T.Large scale data classification based on hierarchical clustering and resampling[J].Journal of Computer Applications,2013,33(10):2801-2803.(in Chinese)
张永,浮盼盼,张玉婷.基于分层聚类及重采样的大规模数据分类[J].计算机应用,2013,33(10):2801-2803.
[10]CHAWLA N,BOWYER K,HALL L,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[11]LONGADGE R,DONGRE S.Class Imbalance Problem in Data Mining:Review[C]∥International Journal of Computer Science and Network,2013,2(1).
[12]GUO Y W,LIU X X.Study on the Method of Information Gain Feature Selection in Chinese Text Classification [J].Computer Engineering and Applications.2012,48(27):119-122.(in Chinese)
郭亚维,刘晓霞.中文文本分类中信息增益特征选择方法的研究[J].计算机工程与应用.2012,48(27):119-122.
[13]YIN L Z,GE Y,XIAO K L,et al.Feature selection for high-dimensional unbalanced data[J].Neurocomputing,2013,105(1):3-11.
[14]BAI P,ZHANG X B,ZHANG B,et al.Support vector machine theory and engineering application examples [M].Xi’an:Xi’an University of Electronic Science and Technology Press,2008:28-30.(in Chinese)
白鹏,张喜斌,张斌,等.支持向量机理论及工程应用实例[M].西安:西安电子科技大学出版社,2008:28-30.
[15]AKBANI R ,KWEK S,JAPKOWICZ N.Applying Support Vector Machines to Imbalanced Datasets[C]∥ European Conference on Machine Learning.Springer Berlin Heidelberg,2004.
[16]DIAO C X.W-SVM Model for Unbalanced Data Set Classification [D].Hefei:Hefei University of Technology,2012.(in Chinese)
刁翠霞.面向不均衡数据集分类的W-SVM模型[D].合肥:合肥工业大学,2012.
[17]GUO H Y,VIKTOR H L.Learning from Imbalanced Data Sets with Boosting and Data Generation:The DataBoost-IM Approach[J].Acm Sigkdd Explorations Newsletter,2004,6(1):30-39.
[1] LIU Quan-ming, LI Yin-nan, GUO Ting, LI Yan-wei. Intrusion Detection Method Based on Borderline-SMOTE and Double Attention [J]. Computer Science, 2021, 48(3): 327-332.
[2] QU Hao, CUI Chao-ran, WANG Xiao-xiao, SU Ya-xi, HAN Xiao-hui, YIN Yi-long. Hierarchical Learning on Unbalanced Data for Predicting Cause of Action [J]. Computer Science, 2021, 48(12): 337-342.
[3] WANG Xiao-xiao, WANG Ting-wen, MA Yu-ling, FAN Jia-yi, CUI Chao-ran. Credit Risk Assessment Method of P2P Online Loan Borrowers Based on Deep Forest [J]. Computer Science, 2021, 48(11A): 429-434.
[4] XU Shou-kun, NI Chu-han, JI Chen-chen, LI Ning. Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3 [J]. Computer Science, 2020, 47(8): 233-240.
[5] ZHONG Ya,GUO Yuan-bo,LIU Chun-hui,LI Tao. User Attributes Profiling Method and Application in Insider Threat Detection [J]. Computer Science, 2020, 47(3): 292-297.
[6] RAO Meng,MIAO Duo-qian,LUO Sheng. Rough Uncertain Image Segmentation Method [J]. Computer Science, 2020, 47(2): 72-75.
[7] DONG Ben-qing, LI Feng-kun. Analysis of Emotional Degree of Poetry Reading Based on WDOUDT [J]. Computer Science, 2020, 47(11A): 46-51.
[8] JIAO Yang, YANG Chuan-ying, SHI Bao. Relevance Feedback Method Based on SVM in Shoeprint Images Retrieval [J]. Computer Science, 2020, 47(11A): 244-247.
[9] YAO Li-shuang, LIU Dan, PEI Zuo-fei, WANG Yun-feng. Real-time Network Traffic Prediction Model Based on EMD and Clustering [J]. Computer Science, 2020, 47(11A): 316-320.
[10] LI Gui-hui,LI Jin-jiang,FAN Hui. Image Denoising Algorithm Based on Adaptive Matching Pursuit [J]. Computer Science, 2020, 47(1): 176-185.
[11] JIANG Hua,WU Yao,WANG Xin,WANG Hui-jiao. Study on Ocean Data Anomaly Detection Algorithm Based on Improved K-means Clustering [J]. Computer Science, 2019, 46(7): 211-216.
[12] WEN Jun-hao,WAN Yuan,ZENG Jun,WANG Xi-bin,LIANG Guan-zhong. Application of Illumination Clustering and SVM in Energy-saving Control Strategy of Street Lamps [J]. Computer Science, 2019, 46(7): 327-332.
[13] LIU Chang-qi, SHAO Kun, HUO Xing, FAN Dong-yang, TAN Jie-qing. K-means Image Segmentation Algorithm Based on Weighted Quality Evaluation Function [J]. Computer Science, 2019, 46(6A): 158-160.
[14] HOU Yuan-yuan, HE Ru-han, LI Min, CHEN Jia. Clothing Image Retrieval Method Combining Convolutional Neural Network Multi-layerFeature Fusion and K-Means Clustering [J]. Computer Science, 2019, 46(6A): 215-221.
[15] HUANG Hai-yan, LIU Xiao-ming, SUN Hua-yong, YANG Zhi-cai. Application of Clustering Analysis Algorithm in Uncertainty Decision Making [J]. Computer Science, 2019, 46(6A): 593-597.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!