计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 22-27.doi: 10.11896/j.issn.1002-137X.2019.04.004

• 大数据与数据科学 • 上一篇    下一篇

基于层次聚类的不平衡数据加权过采样方法

夏英, 李刘杰, 张旭, 裴海英   

  1. 重庆邮电大学计算机科学与技术学院 重庆400065
  • 收稿日期:2018-09-23 出版日期:2019-04-15 发布日期:2019-04-23
  • 通讯作者: 李刘杰(1993-),男,硕士生,主要研究方向为数据挖掘,E-mail:S160201032@stu.cqupt.edu.cn(通信作者)
  • 作者简介:夏 英(1972-),女,博士,教授,主要研究方向为数据库与数据挖掘;张 旭(1981-),男,博士,副教授,主要研究方向为数据挖掘、大数据分析;裴海英(1948-),男,博士,教授,主要研究方向为数据库、空间信息处理。
  • 基金资助:
    本文受国家自然科学基金(41571401)资助。

Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data

XIA Ying, LI Liu-jie, ZHANG XU, BAE Hae-young   

  1. School of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Received:2018-09-23 Online:2019-04-15 Published:2019-04-23

摘要: 不平衡数据对传统分类算法的性能有一定影响,使得少数类的识别率降低。过采样是处理不平衡数据集的常用方法之一,其主要思想是通过增加少数类样本,使得少数类与多数类的数量能够在一定程度上达到平衡,但现有的过采样方法存在合成重叠样本以及过拟合的问题。文中提出一种基于层次聚类的不平衡数据加权过采样方法WOHC(Weighted Oversampling method based on Hierarchical Clustering)。该方法首先使用层次聚类算法对少数类进行聚类,将少数类样本划分为多个类簇,然后计算出类簇的密度因子来确定各类簇的采样倍率,最后根据每个类簇中样本与多数类边界的距离确定采样权重。利用该方法采样并结合C4.5算法在多个数据集上进行分类实验,结果表明使用该方法能够使分类算法在F-measure和G-mean指标上分别提升7.6%和5.8%,体现了该方法的有效性。

关键词: 不平衡数据, 层次聚类, 过采样, 重叠样本

Abstract: Imbalanced data affect the performance of traditional classification algorithms to some extent,leading to a lower recognition rate for minority classes.Oversampling is one of the common methods for processing Imbalanced data-sets.Its main idea is to increase the number of minority class samples so that the number of minority classes and majority classes can be balanced to a certain extent.Existing oversampling methods have problems of synthesis of overlapping samples and overfitting.This paper proposed a weighted oversampling method based on hierarchical clustering for Imbalanced data,named WOHC.It uses hierarchical clustering algorithm to divide the minority class samples into several clusters first,then it calculates the clusters’ density factors to determine the sampling rate of each cluster,and finally determines the sampling weights according to the distance between the minority classes and the boundary of majority classes.In the experiments,WOHC method is adopted for oversampling and C4.5 algorithm is combined to perform the classification experiment on several datasets.Results show that the proposed method can improve the performance of algorithm by 7.6% and 5.8% on F-measure and G-mean respectively,which indicates the effectiveness of the method.

Key words: Hierarchical clustering, Imbalanced data, Overlapping sample, Oversampling

中图分类号: 

  • TP301
[1]MALHOTRA R,KHANNA M .An empirical study for soft-
ware change prediction using imbalanced data[J].Empirical Software Engineering,2017,22(6):1-46.
[2]JEONG H,JANG Y,BOWMAN P J,et al.Classification of mo-
tor vehicle crash injury severity:A hybrid approach for imba-lanced data[J].Accident Analysis & Prevention,2018,120:250-261.
[3]JIANG J ,LIU X ,ZHANG K ,et al.Automatic diagnosis of imbalanced ophthalmic images using a cost-sensitive deep convolutional neural network[J].BioMedical Engineering OnLine,2017,16(1):132.
[4]LI Y,GUO H,ZHANG Q,et al.Imbalanced text sentiment
classification using universal and domain-specific knowledge[J].Knowledge-Based Systems,2018,160:1-15.
[5]DAL P A .Learned lessons in credit card fraud detection from a practitioner perspective[J].Expert Systems with Applications,2014,41(10):4915-4928.
[6]TANG B,HE H.GIR-based Ensemble Sampling Approaches for Imbalanced Learning[J].Pattern Recognition,2017,71:306-319.
[7]BIAN J,PENG X G,WANG Y,et al.An Efficient Cost-Sensitive Feature Selection Using Chaos Genetic Algorithm for Class Imbalance Problem[J].Mathematical Problems in Engineering,2016,2016(6):1-9.
[8]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[9]HE H,GARCIA E A.Learning from Imbalanced Data[J].IEEE
Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[10]BUNKHUMPORNPAT C,SINAPIROMSARAN K,LURSIN-
SAP C.Safe-Level-SMOTE:Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem[C]∥Pacific-Asia Conference on Advances in Know-ledge Discovery and Data Mining.Springer-Verlag,2009:475-482.
[11]WANG J H,DUAN B Q.Research on a density based SMOTE
method.CAAI Transactions on Intelligent Systems,2017(6):865-872.(in Chinese)
王俊红,段冰倩.一种基于密度的SMOTE方法研究[J].智能系统学报,2017(6):865-872.
[12]CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating imbalance in network intrusion datasets[C]∥IEEE International Conference on Granular Computing.IEEE,2006:732-737.
[13]LIU Y X,LIU S M,LIU T,et al.A new oversampling algorithm DB-SMOTE[J].Computer Engineering and Applications,2014,50(6):92-95.(in Chinese)
刘余霞,刘三民,刘涛,等.一种新的过采样算法DB-SMOTE[J].计算机工程与应用,2014,50(6):92-95.
[14]VOORHEES E M.Implementing agglomerative hierarchic clustering algorithms for use in document retrieval [J].Information Processing & Management,1986,22(6):465-476.
[15]CHEN S,GUO G D,CHEN L F.Unbalanced data classification method based on clustering fusion[j].Pattern Recognition and Artificial Intelligence,2010,23(6):772-780.(in Chinese)
陈思,郭躬德,陈黎飞.基于聚类融合的不平衡数据分类方法[J].模式识别与人工智能,2010,23(6):772-780.
[16]MATHEW J,PANG C K,LUO M,et al.Classification of Imba-
lanced Data by Oversampling in Kernel Space of Support Vector Machines.IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4065-4076.
[17]UCI Machine Learning Repository.http://archive.
ics.uci.edu/ml/index.php.
[18]BOMBARA G,VASILE C I,PENEDO F,et al.A Decision Tree Approach to Data Classification using Signal Temporal Logic[C]∥International Conference on Hybrid Systems:Computation and Control.ACM,2016:1-10.
[1] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[2] 董奇达, 王喆, 吴松洋.
结合注意力机制与几何信息的特征融合框架
Feature Fusion Framework Combining Attention Mechanism and Geometric Information
计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180
[3] 江昊琛, 魏子麒, 刘璘, 陈俊.
非均衡数据分类经典方法综述与面向医疗领域的实验分析
Imbalanced Data Classification:A Survey and Experiments in Medical Domain
计算机科学, 2022, 49(1): 80-88. https://doi.org/10.11896/jsjkx.210200124
[4] 黄颖琦, 陈红梅.
基于代价敏感卷积神经网络的非平衡问题混合方法
Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification
计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013
[5] 张仁杰, 陈伟, 杭梦鑫, 吴礼发.
基于变分自编码器的不平衡样本异常流量检测
Detection of Abnormal Flow of Imbalanced Samples Based on Variational Autoencoder
计算机科学, 2021, 48(7): 62-69. https://doi.org/10.11896/jsjkx.200600022
[6] 郑建华, 李小敏, 刘双印, 李迪.
融合级联上采样与下采样的改进随机森林不平衡数据分类算法
Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling
计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120
[7] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[8] 张人之, 朱焱.
基于主动学习的社交网络恶意用户检测方法
Malicious User Detection Method for Social Network Based on Active Learning
计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151
[9] 王萧萧, 王亭雯, 马玉玲, 范佳奕, 崔超然.
基于深度森林的P2P网贷借款人信用风险评估方法
Credit Risk Assessment Method of P2P Online Loan Borrowers Based on Deep Forest
计算机科学, 2021, 48(11A): 429-434. https://doi.org/10.11896/jsjkx.201000013
[10] 欧阳鹏, 陆璐, 张凡龙, 邱少健.
基于迁移学习和过采样技术的跨项目克隆代码一致性维护需求预测
Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology
计算机科学, 2020, 47(9): 10-16. https://doi.org/10.11896/jsjkx.200400041
[11] 陈庆超, 王韬, 冯文博, 尹世庄, 刘丽君.
基于最长连续间隔的未知二进制协议格式推断
Unknown Binary Protocol Format Inference Method Based on Longest Continuous Interval
计算机科学, 2020, 47(8): 313-318. https://doi.org/10.11896/jsjkx.190700031
[12] 宋玲玲, 王时绘, 杨超, 盛潇.
改进的XGBoost在不平衡数据处理中的应用研究
Application Research of Improved XGBoost in Imbalanced Data Processing
计算机科学, 2020, 47(6): 98-103. https://doi.org/10.11896/jsjkx.191200138
[13] 向伟, 王新维.
基于多类邻域三支决策模型的不平衡数据分类
Imbalance Data Classification Based on Model of Multi-class Neighbourhood Three-way Decision
计算机科学, 2020, 47(5): 103-109. https://doi.org/10.11896/jsjkx.180601099
[14] 张云帆,周宇,黄志球.
基于语义相似度的API使用模式推荐
Semantic Similarity Based API Usage Pattern Recommendation
计算机科学, 2020, 47(3): 34-40. https://doi.org/10.11896/jsjkx.190300053
[15] 蔡莉, 李英姿, 江芳, 梁宇.
面向城市热点区域的不平衡数据聚类挖掘研究
Study on Clustering Mining of Imbalanced Data Fusion Towards Urban Hotspots
计算机科学, 2019, 46(8): 16-22. https://doi.org/10.11896/j.issn.1002-137X.2019.08.003
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!