计算机科学 ›› 2018, Vol. 45 ›› Issue (9): 260-265.doi: 10.11896/j.issn.1002-137X.2018.09.043

• 人工智能 • 上一篇    下一篇

基于NKSMOTE算法的非平衡数据集分类方法

王莉, 陈红梅   

  1. 西南交通大学信息科学与技术学院 成都611756
    云计算与智能技术高校重点实验室西南交通大学 成都611756
  • 收稿日期:2017-08-12 出版日期:2018-09-20 发布日期:2018-10-10
  • 通讯作者: 陈红梅(1971-),女,博士,教授,博士生导师,CCF会员,主要研究方向为智能信息处理、数据挖掘,E-mail:hmchen@swjtu.edu.cn
  • 作者简介:王 莉(1992-),女,硕士,CCF会员,主要研究方向为数据挖掘,E-mail:13618022145@163.com
  • 基金资助:
    本文受国家自然科学基金(61572406)资助。

NKSMOTE Algorithm Based Classification Method for Imbalanced Dataset

WANG Li, CHEN Hong-mei   

  1. School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China
    Key Laboratory of Cloud Computing and Intelligent TechnologySouthwest Jiaotong University,Chengdu 611756,China
  • Received:2017-08-12 Online:2018-09-20 Published:2018-10-10

摘要: SMOTE(Synthetic Minority Over-sampling TEchnique)在进行样本合成时只在少数类中求其K近邻,这会导致过采样之后少数类样本的密集程度不变的问题。鉴于此,提出一种新的过采样算法NKSMOTE(New Kernel Synthetic Minority Over-Sampling Technique)。该算法首先利用一个非线性映射函数将样本映射到一个高维的核空间,然后在核空间上计算少数类样本在所有样本中的K个近邻,最后根据少数类样本的分布对算法分类性能的影响程度赋予少数类样本不同的向上采样倍率,从而改变数据集的非平衡度。实验采用决策树(Decision Tree,DT)、误差逆传播算法(error BackPropagation,BP)、随机森林(Random Forest,RF)作为分类算法,并将几类经典的过采样方法和文中提出的过采样方法进行多组对比实验。在UCI数据集上的实验结果表明,NKSMOTE算法具有更好的分类性能。

关键词: SMOTE算法, 非平衡度, 分类, 过采样, 核空间

Abstract: In SMOTE(Synthetic Minority Over-sampling TEchnique),only minority class samples nearest to neighbors are computed when samples are synthesized,causing the problem that the density of the minority class samples remains unchanged after oversampling.This paper proposed an improved NKSMOTE(New Kernel Synthetic Minority Over-Sampling Technique) algorithm to overcome the shortage of SMOTE.Firstly,a nonlinear mapping function is used to map samples to a high-dimensional kernel space,and then the K nearest neighbors of samples of minority class from the whole samples are computed.In addition,different over-sampling rates are set on different minority samples to change the imbalanced multiplying power according to the influencecaused by the distribution of minority class samples on the classification performance of algorithm.In the experiments,some classical oversampling methods were compared with the proposed oversampling method,and Decision Tree(DT),error BackPropagation(BP) and Random Forest(RF) were chosen as base classifier.Experimental results on UCI data sets show better classification performance of NKSMOTE algorithm.

Key words: Classification, Imbalanced rate, Kernel space, Over-sampling, SMOTE algorithm

中图分类号: 

  • TP311
[1]WEISS G M,ZADROZNY B,SAAR M.Guest editorial:special issue on utility-based data mining[J].Data Mining and Know-ledge Discovery,2008,17(2):129-135.
[2]DEL C,SERRANO J.A multistrategy approach for digital text categorization from imbalanced documents[J].Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining Explorations,2004,6(1):70-79.
[3]WEI W,LI J,CAO L.Effective detection of sophisticated online banking fraud on extremely imbalanced data[J].World Wide Web,2013,16(4):449-475.
[4]HANG Z.Imbalanced data classification method and its application research for intrusion detection[J].Computer Science,2013,40(4):131-135.
[5]KUBAT M,HOLTE R C,MATWIN S.Machine learning for the detection of oil spills in satellite radar images[J].Machine Learning,1998,30(2):195-215.
[6]ZHANG J W.Imbalanced data classification and its application in cancer recognition[D].Hangzhou:China Jiliang Unversity,2012.(in Chinese)
张金伟.不平衡数据分类研究及在肿瘤识别中的应用[D].杭州:中国计量学院,2012.
[7]JASON V H,TAGHI K.Knowledge discovery from imbalanced and noisy data[J].Data Knowledge Engineering,2009,68(12):1513-1542.
[8]YANG Z M,QIAO L Y,PENG X Y.Research on datamining
method for imbalanced dataset based on improved SMOTE[J].Acta Electronica Sinica,2007,35(12):22-26.(in Chinese)
杨智明,乔立岩,彭喜元.基于改进 SMOTE 的不平衡数据挖掘方法研究[J].电子学报,2007,35(12):22-26.
[9]WANG L Y.Research of boosting classification algorithm for
imbalance data[D].Harbin:Harbin Institute of Technology,2013.(in Chinese)
王璐林.面向不平衡样本的Boosting分类算法研究[D].哈尔滨:哈尔滨工业大学,2013.
[10]HU X S,WEN J P,ZHONG Y.Imbalanced data ensemble classification using dynamic balance sampling[J].CAAI Transactions on Intelligent Systems,2016,11(2):257-263.(in Chinese)
胡小生,温菊屏,钟勇.动态平衡采样的不平衡数据集成分类方法[J].智能系统学报,2016,11(2):257-263.
[11]GALAR M,FERNANDEZ A,BARRENECHEA E.Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced data sets[J].Information Sciences,2016,354(C):178-196.
[12]KIM M J,KANG D K,HONG B K.Geometric mean based
boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction[J].Expert Systems with Applications,2015,42(3):1074-1082.
[13]CHAWLA N V,BOWYER K W,HALLO L O.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[14]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new
over-sampling method in imbalanced data sets learning[C]∥Proc. of International Conference on Intelligent Computing.2005:878-887.
[15]DONG Y,WANG X.A new over-sampling approach:Random-SMOTE for learning from imbalanced data Sets[C]∥Internation al Conference on Knowledge Science,Engineering and Ma-nagement.2011:343-352.
[16]WANG C X,PAN Z M,DONG L L,et al.Research on classification for imbalanced dataset based on improved SMOTE[J].Computer Engineering and Applications,2013,49(2):184-187.(in Chinese)
王超学,潘正茂,董丽丽,等.基于改进SMOTE的非平衡数据集分类研究[J].计算机工程与应用,2013,49(2):184-187.
[17]CRISTIANINI N,SHAWE T J.An introduction to support vector machines:and other kernel-based learning methods[M].Cambridge University Press,2000.
[18]SCHOIKOPF B,MIKA S,BURGES C J C.Input space versus featurespace in kernei-based methods[J].IEEE Transactions on Neural Networks,1999,10(5):1000-1017.
[19]TAO X M,ZHANG D M,HAO S Y.SVM classifier forunba-lanced data based on spectrum cluster-based under-sampling approaches[J].Control and Decision,2012,27(12):1761-1768.(in Chinese)
陶新民,张冬梅,郝思媛.基于谱聚类欠取样的不均衡数据 SVM算法[J].控制与决策,2012,27(12):1761-1768.
[20]ZENG Z Q,WU Q,LIAO B S.A classfication method for imba-lance data set based on kernel SMOTE[J].Acta Electronica Si-nica,2009,37(11):2489-2495.(in Chinese)
曾志强,吴群,廖备水.一种基于核SMOTE的非平衡数据集分类方法[J].电子学报,2009,37(11):2489-2495.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[4] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[5] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[7] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[8] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[9] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[10] 黄璞, 沈阳阳, 杜旭然, 杨章静.
基于局部约束特征线表示的人脸识别
Face Recognition Based on Locality Constrained Feature Line Representation
计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169
[11] 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行.
基于步态分类辅助的虚拟IMU的行人导航方法
Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification
计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148
[12] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[13] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[14] 杨健楠, 张帆.
一种结合双注意力机制和层次网络结构的细碎农作物分类方法
Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure
计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
[15] 庞兴龙, 朱国胜.
基于半监督学习的网络流量分析研究
Survey of Network Traffic Analysis Based on Semi Supervised Learning
计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!