计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 194-198.
刘树栋, 魏嘉敏
LIU Shu-dong, WEI Jia-min
摘要: 面向类别不均衡数据集的分类学习一直是数据挖掘和机器学习领域的研究热点。数据级、算法级和集成方法是目前解决类别不均衡学习的3种主流方法,其中欠抽样是类别不均衡学习一种常用的数据级解决方法,其缺点在于容易丢失多数类中部分有用信息。文中将谱聚类引入到成对数据表示的多数类欠抽样过程中,首先利用谱聚类方法,对多数类样本进行聚类,根据聚类簇大小和簇内样本点与少数类样本点的平均距离,在每个聚类簇内抽取不同个数有代表性的样本,并将簇内样本点之间及所有少数类样本点两两成对表示,从而有效降低了所有样本成对数据表示中两两组合而导致的数据暴涨问题,同时避免了随机抽样而可能导致的有效信息丢失问题。最后在9组UCI数据集上验证了所提算法的有效性。
中图分类号:
[1]PROBOST F.Machine learning from imbalanced data set 101[C]∥Proceedings of Workshop on Learning from Imbalanced Data Set (AAAI’00).Palo Alto,CA:AAAI,2000:1-3. [2]CHAWLA N V,JAPKOWICZ N,KOLCZ A.Editorial:specialissue on learning from imbalanced data sets[J].SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets,2004,6(1):1-6. [3]GALAR M,FERNANDEZ A,BARRENCHEA E,et al.A review on ensembles for the class imbalance problem:Bagging-,Boosting-,and hybrid-based approaches[J].IEEE Transaction on Systems,Man and Cybernetics,2012,42(4):463-484. [4]KRAWCZYK B.Learning from imbalanced data:open challenge and future directions[J].Progress in Artificial Intelligence,2016,5(4):1-12. [5]ROY A,CRUZ R M O,CAVALCANI G D C.A study on combining dynamic selection and data preprocessing for imbalanced learning[J].Neurocom-puting,2018,286:179-192. [6]GUO H,LI Y,JENNIFER S,et al.Learning from class-imba-lanced data:review of methods and applications[J].Expert Systems with Applications,2017,73:220-239. [7]YANG Q,WU X.10 challenging problems in data mining research[J].International Journal of Information Technology and Decision Making,2006,5(4):597-604. [8]FERNANDEZ A,RIO S,CHAWLA N V,et al.An insight into imbalanced big data classification:outcomes and challenges[J].Complex Intelligent Systems,2017,3(2):105-120. [9]GUERMAZI R,CHAABANE I,HAMMAMI M.AECID:asymmetric entropy for classifying imbalanced data[J].Information Sciences,2018,467:373-397. [10]WU F,JING X,SHIN S,et al.Multiset feature learning for highly imbalanced data classification[C]∥Proceedings of the thirty-first AAAI Conference on Artificial Intelligence.Palo Alto,CA:AAAI,2017:1583-1589. [11]LOYOLA-GONZALEZ O,MARTINEZ-TRINIDAD J F,CARRASCO-OCHOA J A.Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases[J].Neurocomputing,2016,175:935-947. [12]LIN C,HSIEH T,LIN Y,et al.Minority Oversampling in Kernel Adaptive Subspaces for Class Imbalanced Datasets[J].IEEE Transactions on Knowledge and Data Engineering,2018,30(5):950-962. [13]SHAHEE S A,ANANTHAKUMAR U.An adaptive oversampling technique for imbalanced datasets[C]∥Proceedings of IEEE International Conference on Data Mining (ICDM’18).NJ:IEEE,2018:1-16. [14]LIN W,TSAI C,HU Y,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409/410:17-26. [15]LI F,ZHANG X,ZHANG X,et al.Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets[J].Information Sciences,2018,422:242-256. [16]DECHERCHI S,ROCCHIA W.Import vector domain descrip-tion:a kernel logistic one-class learning algorithm[J].IEEE Transactions on Neural Networks and Learning Systems,2017,28(7):1722-1729. [17]FERNANDEZ-FRANCOS D,FONTENLA-ROMERO O,ALONSO-BETANZOS A.One-class convex hull-based algorithm for classification in distributed environments [J].IEEE Transactions on Systems,Man and Cybernetics,2017,99:1-11. [18]SUN J,SHAO J,HE C.Abnormal event detection for video surveillance using deep one-class learning[J].Multimedia Tools and Applications,2017,3:1-15. [19]ERFANI S M,REJASEGARAR S,KARUNA-SEKERA S,et al.High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning[J].Pattern Recognition,2016,58(C):121-134. [20]FERDOWSI Z,GHANI R,SETTIMI R.Online active learning with imbalanced Classes[C]∥Proceedings of IEEE 13th International Conference on Data Mining (ICDM’13),NJ:IEEE,2013:1043-1048. [21]ZHANG X,YANG T,SRINIVASAN P.Online asymmetric active learning with imbalanced data[C]∥Proceedings of 22th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’16).New York:ACM.2016:2055-2064. [22]RAMIREZ-LOAIZA M,SHARMA M,KUMAR G,et al.Active learning:An empirical study of common baselines[J].Data Mi-ning and Knowledge Discovery,2017,31:287-313. [23]ZHANG Y,ZHAO P,CAO J,et al.Online adaptive asymmetric active learning for budgeted imbalanced data[C]∥Proceedings of 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’18).New York:ACM.2018:2768-2777. [24]LI K,KONG X,LU Z.Boosting weighted ELM for imbalanced learning[J].Neurocomputing,2014,128:15-21. [25]YU H,SUN C,YANG X,et al.ODC-ELM:optimal decisionoutputs compensation-based extreme learning machine for classifying imbalanced data[J].Knowledge-Based Systems,2016,92:55-70. [26]DING S,MIRZA B,LIN Z,et al.Kernel based online learning for imbalance multi- class classification[J].Neurocomputing,2018,277:139-148. [27]DUMPALA S H,CHAKRABORTY R,KOPPARAPU SK.A novel data representation for effective learning in class imbalanced scenarios[C]∥Proceedings of the Twenty-seventh International Joint Conference on Artificial Intelligence.2018:2100-2106. [28]贾洪杰,丁世飞,史忠植.求解大规模谱聚类的近似加权核k-means算法[J].软件学报,2015,26(11):2836-2846. [29]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority oversampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. [30]HART P.The condensed nearest neighbor rule [J].IEEETransactions on Information Theory,1968,14:515-516. [31]TANG Y,ZHANG Y,CHAWLA N V,et al.SVMs modeling for highly imbalanced classification [J].IEEE Transactions on Systems,Man,and Cybernetics,2009,39(1):281-288. [32]GALAR M,FERNANDEZ A,BARRENECHEA E,et al.Eusboost:Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling [J].Pattern Recognition,2013,(12):3460-3471. [33]SEIFFERT C,KHOSHGOFTAAR T M,HULSE J V,et al.RUSBoost:a hybrid approach to alleviating class imbalance [J].IEEE Transactions on Systems,Man,and Cybernetics,2010,40(1):185-197. |
[1] | 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112 |
[2] | 周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022 |
[3] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[4] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[5] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[6] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[7] | 高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226 |
[8] | 杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070 |
[9] | 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028 |
[10] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[11] | 王文强, 贾星星, 李朋. 自适应的集成定序算法 Adaptive Ensemble Ordering Algorithm 计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108 |
[12] | 陈景年. 一种适于多分类问题的支持向量机加速方法 Acceleration of SVM for Multi-class Classification 计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149 |
[13] | 杨健楠, 张帆. 一种结合双注意力机制和层次网络结构的细碎农作物分类方法 Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure 计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169 |
[14] | 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行. 基于步态分类辅助的虚拟IMU的行人导航方法 Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification 计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148 |
[15] | 黄璞, 沈阳阳, 杜旭然, 杨章静. 基于局部约束特征线表示的人脸识别 Face Recognition Based on Locality Constrained Feature Line Representation 计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169 |
|