计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 22-27.doi: 10.11896/j.issn.1002-137X.2019.04.004
夏英, 李刘杰, 张旭, 裴海英
XIA Ying, LI Liu-jie, ZHANG XU, BAE Hae-young
摘要: 不平衡数据对传统分类算法的性能有一定影响,使得少数类的识别率降低。过采样是处理不平衡数据集的常用方法之一,其主要思想是通过增加少数类样本,使得少数类与多数类的数量能够在一定程度上达到平衡,但现有的过采样方法存在合成重叠样本以及过拟合的问题。文中提出一种基于层次聚类的不平衡数据加权过采样方法WOHC(Weighted Oversampling method based on Hierarchical Clustering)。该方法首先使用层次聚类算法对少数类进行聚类,将少数类样本划分为多个类簇,然后计算出类簇的密度因子来确定各类簇的采样倍率,最后根据每个类簇中样本与多数类边界的距离确定采样权重。利用该方法采样并结合C4.5算法在多个数据集上进行分类实验,结果表明使用该方法能够使分类算法在F-measure和G-mean指标上分别提升7.6%和5.8%,体现了该方法的有效性。
中图分类号:
[1]MALHOTRA R,KHANNA M .An empirical study for soft- ware change prediction using imbalanced data[J].Empirical Software Engineering,2017,22(6):1-46. [2]JEONG H,JANG Y,BOWMAN P J,et al.Classification of mo- tor vehicle crash injury severity:A hybrid approach for imba-lanced data[J].Accident Analysis & Prevention,2018,120:250-261. [3]JIANG J ,LIU X ,ZHANG K ,et al.Automatic diagnosis of imbalanced ophthalmic images using a cost-sensitive deep convolutional neural network[J].BioMedical Engineering OnLine,2017,16(1):132. [4]LI Y,GUO H,ZHANG Q,et al.Imbalanced text sentiment classification using universal and domain-specific knowledge[J].Knowledge-Based Systems,2018,160:1-15. [5]DAL P A .Learned lessons in credit card fraud detection from a practitioner perspective[J].Expert Systems with Applications,2014,41(10):4915-4928. [6]TANG B,HE H.GIR-based Ensemble Sampling Approaches for Imbalanced Learning[J].Pattern Recognition,2017,71:306-319. [7]BIAN J,PENG X G,WANG Y,et al.An Efficient Cost-Sensitive Feature Selection Using Chaos Genetic Algorithm for Class Imbalance Problem[J].Mathematical Problems in Engineering,2016,2016(6):1-9. [8]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. [9]HE H,GARCIA E A.Learning from Imbalanced Data[J].IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284. [10]BUNKHUMPORNPAT C,SINAPIROMSARAN K,LURSIN- SAP C.Safe-Level-SMOTE:Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem[C]∥Pacific-Asia Conference on Advances in Know-ledge Discovery and Data Mining.Springer-Verlag,2009:475-482. [11]WANG J H,DUAN B Q.Research on a density based SMOTE method.CAAI Transactions on Intelligent Systems,2017(6):865-872.(in Chinese) 王俊红,段冰倩.一种基于密度的SMOTE方法研究[J].智能系统学报,2017(6):865-872. [12]CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating imbalance in network intrusion datasets[C]∥IEEE International Conference on Granular Computing.IEEE,2006:732-737. [13]LIU Y X,LIU S M,LIU T,et al.A new oversampling algorithm DB-SMOTE[J].Computer Engineering and Applications,2014,50(6):92-95.(in Chinese) 刘余霞,刘三民,刘涛,等.一种新的过采样算法DB-SMOTE[J].计算机工程与应用,2014,50(6):92-95. [14]VOORHEES E M.Implementing agglomerative hierarchic clustering algorithms for use in document retrieval [J].Information Processing & Management,1986,22(6):465-476. [15]CHEN S,GUO G D,CHEN L F.Unbalanced data classification method based on clustering fusion[j].Pattern Recognition and Artificial Intelligence,2010,23(6):772-780.(in Chinese) 陈思,郭躬德,陈黎飞.基于聚类融合的不平衡数据分类方法[J].模式识别与人工智能,2010,23(6):772-780. [16]MATHEW J,PANG C K,LUO M,et al.Classification of Imba- lanced Data by Oversampling in Kernel Space of Support Vector Machines.IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4065-4076. [17]UCI Machine Learning Repository.http://archive. ics.uci.edu/ml/index.php. [18]BOMBARA G,VASILE C I,PENEDO F,et al.A Decision Tree Approach to Data Classification using Signal Temporal Logic[C]∥International Conference on Hybrid Systems:Computation and Control.ACM,2016:1-10. |
[1] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[2] | 董奇达, 王喆, 吴松洋. 结合注意力机制与几何信息的特征融合框架 Feature Fusion Framework Combining Attention Mechanism and Geometric Information 计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180 |
[3] | 江昊琛, 魏子麒, 刘璘, 陈俊. 非均衡数据分类经典方法综述与面向医疗领域的实验分析 Imbalanced Data Classification:A Survey and Experiments in Medical Domain 计算机科学, 2022, 49(1): 80-88. https://doi.org/10.11896/jsjkx.210200124 |
[4] | 黄颖琦, 陈红梅. 基于代价敏感卷积神经网络的非平衡问题混合方法 Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification 计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013 |
[5] | 张仁杰, 陈伟, 杭梦鑫, 吴礼发. 基于变分自编码器的不平衡样本异常流量检测 Detection of Abnormal Flow of Imbalanced Samples Based on Variational Autoencoder 计算机科学, 2021, 48(7): 62-69. https://doi.org/10.11896/jsjkx.200600022 |
[6] | 郑建华, 李小敏, 刘双印, 李迪. 融合级联上采样与下采样的改进随机森林不平衡数据分类算法 Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling 计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120 |
[7] | 陈静杰, 王琨. 不平衡油耗数据的区间预测方法 Interval Prediction Method for Imbalanced Fuel Consumption Data 计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145 |
[8] | 张人之, 朱焱. 基于主动学习的社交网络恶意用户检测方法 Malicious User Detection Method for Social Network Based on Active Learning 计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151 |
[9] | 王萧萧, 王亭雯, 马玉玲, 范佳奕, 崔超然. 基于深度森林的P2P网贷借款人信用风险评估方法 Credit Risk Assessment Method of P2P Online Loan Borrowers Based on Deep Forest 计算机科学, 2021, 48(11A): 429-434. https://doi.org/10.11896/jsjkx.201000013 |
[10] | 欧阳鹏, 陆璐, 张凡龙, 邱少健. 基于迁移学习和过采样技术的跨项目克隆代码一致性维护需求预测 Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology 计算机科学, 2020, 47(9): 10-16. https://doi.org/10.11896/jsjkx.200400041 |
[11] | 陈庆超, 王韬, 冯文博, 尹世庄, 刘丽君. 基于最长连续间隔的未知二进制协议格式推断 Unknown Binary Protocol Format Inference Method Based on Longest Continuous Interval 计算机科学, 2020, 47(8): 313-318. https://doi.org/10.11896/jsjkx.190700031 |
[12] | 宋玲玲, 王时绘, 杨超, 盛潇. 改进的XGBoost在不平衡数据处理中的应用研究 Application Research of Improved XGBoost in Imbalanced Data Processing 计算机科学, 2020, 47(6): 98-103. https://doi.org/10.11896/jsjkx.191200138 |
[13] | 向伟, 王新维. 基于多类邻域三支决策模型的不平衡数据分类 Imbalance Data Classification Based on Model of Multi-class Neighbourhood Three-way Decision 计算机科学, 2020, 47(5): 103-109. https://doi.org/10.11896/jsjkx.180601099 |
[14] | 张云帆,周宇,黄志球. 基于语义相似度的API使用模式推荐 Semantic Similarity Based API Usage Pattern Recommendation 计算机科学, 2020, 47(3): 34-40. https://doi.org/10.11896/jsjkx.190300053 |
[15] | 蔡莉, 李英姿, 江芳, 梁宇. 面向城市热点区域的不平衡数据聚类挖掘研究 Study on Clustering Mining of Imbalanced Data Fusion Towards Urban Hotspots 计算机科学, 2019, 46(8): 16-22. https://doi.org/10.11896/j.issn.1002-137X.2019.08.003 |
|