Computer Science ›› 2018, Vol. 45 ›› Issue (9): 260-265.doi: 10.11896/j.issn.1002-137X.2018.09.043

• Artificial Intelligence • Previous Articles     Next Articles

NKSMOTE Algorithm Based Classification Method for Imbalanced Dataset

WANG Li, CHEN Hong-mei   

  1. School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China
    Key Laboratory of Cloud Computing and Intelligent TechnologySouthwest Jiaotong University,Chengdu 611756,China
  • Received:2017-08-12 Online:2018-09-20 Published:2018-10-10

Abstract: In SMOTE(Synthetic Minority Over-sampling TEchnique),only minority class samples nearest to neighbors are computed when samples are synthesized,causing the problem that the density of the minority class samples remains unchanged after oversampling.This paper proposed an improved NKSMOTE(New Kernel Synthetic Minority Over-Sampling Technique) algorithm to overcome the shortage of SMOTE.Firstly,a nonlinear mapping function is used to map samples to a high-dimensional kernel space,and then the K nearest neighbors of samples of minority class from the whole samples are computed.In addition,different over-sampling rates are set on different minority samples to change the imbalanced multiplying power according to the influencecaused by the distribution of minority class samples on the classification performance of algorithm.In the experiments,some classical oversampling methods were compared with the proposed oversampling method,and Decision Tree(DT),error BackPropagation(BP) and Random Forest(RF) were chosen as base classifier.Experimental results on UCI data sets show better classification performance of NKSMOTE algorithm.

Key words: Classification, Imbalanced rate, Kernel space, Over-sampling, SMOTE algorithm

CLC Number: 

  • TP311
[1]WEISS G M,ZADROZNY B,SAAR M.Guest editorial:special issue on utility-based data mining[J].Data Mining and Know-ledge Discovery,2008,17(2):129-135.
[2]DEL C,SERRANO J.A multistrategy approach for digital text categorization from imbalanced documents[J].Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining Explorations,2004,6(1):70-79.
[3]WEI W,LI J,CAO L.Effective detection of sophisticated online banking fraud on extremely imbalanced data[J].World Wide Web,2013,16(4):449-475.
[4]HANG Z.Imbalanced data classification method and its application research for intrusion detection[J].Computer Science,2013,40(4):131-135.
[5]KUBAT M,HOLTE R C,MATWIN S.Machine learning for the detection of oil spills in satellite radar images[J].Machine Learning,1998,30(2):195-215.
[6]ZHANG J W.Imbalanced data classification and its application in cancer recognition[D].Hangzhou:China Jiliang Unversity,2012.(in Chinese)
张金伟.不平衡数据分类研究及在肿瘤识别中的应用[D].杭州:中国计量学院,2012.
[7]JASON V H,TAGHI K.Knowledge discovery from imbalanced and noisy data[J].Data Knowledge Engineering,2009,68(12):1513-1542.
[8]YANG Z M,QIAO L Y,PENG X Y.Research on datamining
method for imbalanced dataset based on improved SMOTE[J].Acta Electronica Sinica,2007,35(12):22-26.(in Chinese)
杨智明,乔立岩,彭喜元.基于改进 SMOTE 的不平衡数据挖掘方法研究[J].电子学报,2007,35(12):22-26.
[9]WANG L Y.Research of boosting classification algorithm for
imbalance data[D].Harbin:Harbin Institute of Technology,2013.(in Chinese)
王璐林.面向不平衡样本的Boosting分类算法研究[D].哈尔滨:哈尔滨工业大学,2013.
[10]HU X S,WEN J P,ZHONG Y.Imbalanced data ensemble classification using dynamic balance sampling[J].CAAI Transactions on Intelligent Systems,2016,11(2):257-263.(in Chinese)
胡小生,温菊屏,钟勇.动态平衡采样的不平衡数据集成分类方法[J].智能系统学报,2016,11(2):257-263.
[11]GALAR M,FERNANDEZ A,BARRENECHEA E.Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced data sets[J].Information Sciences,2016,354(C):178-196.
[12]KIM M J,KANG D K,HONG B K.Geometric mean based
boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction[J].Expert Systems with Applications,2015,42(3):1074-1082.
[13]CHAWLA N V,BOWYER K W,HALLO L O.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[14]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new
over-sampling method in imbalanced data sets learning[C]∥Proc. of International Conference on Intelligent Computing.2005:878-887.
[15]DONG Y,WANG X.A new over-sampling approach:Random-SMOTE for learning from imbalanced data Sets[C]∥Internation al Conference on Knowledge Science,Engineering and Ma-nagement.2011:343-352.
[16]WANG C X,PAN Z M,DONG L L,et al.Research on classification for imbalanced dataset based on improved SMOTE[J].Computer Engineering and Applications,2013,49(2):184-187.(in Chinese)
王超学,潘正茂,董丽丽,等.基于改进SMOTE的非平衡数据集分类研究[J].计算机工程与应用,2013,49(2):184-187.
[17]CRISTIANINI N,SHAWE T J.An introduction to support vector machines:and other kernel-based learning methods[M].Cambridge University Press,2000.
[18]SCHOIKOPF B,MIKA S,BURGES C J C.Input space versus featurespace in kernei-based methods[J].IEEE Transactions on Neural Networks,1999,10(5):1000-1017.
[19]TAO X M,ZHANG D M,HAO S Y.SVM classifier forunba-lanced data based on spectrum cluster-based under-sampling approaches[J].Control and Decision,2012,27(12):1761-1768.(in Chinese)
陶新民,张冬梅,郝思媛.基于谱聚类欠取样的不均衡数据 SVM算法[J].控制与决策,2012,27(12):1761-1768.
[20]ZENG Z Q,WU Q,LIAO B S.A classfication method for imba-lance data set based on kernel SMOTE[J].Acta Electronica Si-nica,2009,37(11):2489-2495.(in Chinese)
曾志强,吴群,廖备水.一种基于核SMOTE的非平衡数据集分类方法[J].电子学报,2009,37(11):2489-2495.
[1] CHEN Zhi-qiang, HAN Meng, LI Mu-hang, WU Hong-xin, ZHANG Xi-long. Survey of Concept Drift Handling Methods in Data Streams [J]. Computer Science, 2022, 49(9): 14-32.
[2] ZHOU Xu, QIAN Sheng-sheng, LI Zhang-ming, FANG Quan, XU Chang-sheng. Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification [J]. Computer Science, 2022, 49(9): 132-138.
[3] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[4] TAN Ying-ying, WANG Jun-li, ZHANG Chao-bo. Review of Text Classification Methods Based on Graph Convolutional Network [J]. Computer Science, 2022, 49(8): 205-216.
[5] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[6] WU Hong-xin, HAN Meng, CHEN Zhi-qiang, ZHANG Xi-long, LI Mu-hang. Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning [J]. Computer Science, 2022, 49(8): 12-25.
[7] GAO Zhen-zhuo, WANG Zhi-hai, LIU Hai-yang. Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features [J]. Computer Science, 2022, 49(7): 40-49.
[8] YANG Bing-xin, GUO Yan-rong, HAO Shi-jie, Hong Ri-chang. Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition [J]. Computer Science, 2022, 49(7): 57-63.
[9] ZHANG Hong-bo, DONG Li-jia, PAN Yu-biao, HSIAO Tsung-chih, ZHANG Hui-zhen, DU Ji-xiang. Survey on Action Quality Assessment Methods in Video Understanding [J]. Computer Science, 2022, 49(7): 79-88.
[10] SHAO Xin-xin. TI-FastText Automatic Goods Classification Algorithm [J]. Computer Science, 2022, 49(6A): 206-210.
[11] CHEN Jing-nian. Acceleration of SVM for Multi-class Classification [J]. Computer Science, 2022, 49(6A): 297-300.
[12] YANG Jian-nan, ZHANG Fan. Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure [J]. Computer Science, 2022, 49(6A): 353-357.
[13] YANG Han, WAN You, CAI Jie-xuan, FANG Ming-yu, WU Zhuo-chao, JIN Yang, QIAN Wei-xing. Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification [J]. Computer Science, 2022, 49(6A): 759-763.
[14] PANG Xing-long, ZHU Guo-sheng. Survey of Network Traffic Analysis Based on Semi Supervised Learning [J]. Computer Science, 2022, 49(6A): 544-554.
[15] WANG Shan, XU Chu-yi, SHI Chun-xiang, ZHANG Ying. Study on Cloud Classification Method of Satellite Cloud Images Based on CNN-LSTM [J]. Computer Science, 2022, 49(6A): 675-679.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!