Computer Science ›› 2019, Vol. 46 ›› Issue (4): 22-27.doi: 10.11896/j.issn.1002-137X.2019.04.004

• Big Data & Data Science • Previous Articles     Next Articles

Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data

XIA Ying, LI Liu-jie, ZHANG XU, BAE Hae-young   

  1. School of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Received:2018-09-23 Online:2019-04-15 Published:2019-04-23

Abstract: Imbalanced data affect the performance of traditional classification algorithms to some extent,leading to a lower recognition rate for minority classes.Oversampling is one of the common methods for processing Imbalanced data-sets.Its main idea is to increase the number of minority class samples so that the number of minority classes and majority classes can be balanced to a certain extent.Existing oversampling methods have problems of synthesis of overlapping samples and overfitting.This paper proposed a weighted oversampling method based on hierarchical clustering for Imbalanced data,named WOHC.It uses hierarchical clustering algorithm to divide the minority class samples into several clusters first,then it calculates the clusters’ density factors to determine the sampling rate of each cluster,and finally determines the sampling weights according to the distance between the minority classes and the boundary of majority classes.In the experiments,WOHC method is adopted for oversampling and C4.5 algorithm is combined to perform the classification experiment on several datasets.Results show that the proposed method can improve the performance of algorithm by 7.6% and 5.8% on F-measure and G-mean respectively,which indicates the effectiveness of the method.

Key words: Hierarchical clustering, Imbalanced data, Overlapping sample, Oversampling

CLC Number: 

  • TP301
[1]MALHOTRA R,KHANNA M .An empirical study for soft-
ware change prediction using imbalanced data[J].Empirical Software Engineering,2017,22(6):1-46.
[2]JEONG H,JANG Y,BOWMAN P J,et al.Classification of mo-
tor vehicle crash injury severity:A hybrid approach for imba-lanced data[J].Accident Analysis & Prevention,2018,120:250-261.
[3]JIANG J ,LIU X ,ZHANG K ,et al.Automatic diagnosis of imbalanced ophthalmic images using a cost-sensitive deep convolutional neural network[J].BioMedical Engineering OnLine,2017,16(1):132.
[4]LI Y,GUO H,ZHANG Q,et al.Imbalanced text sentiment
classification using universal and domain-specific knowledge[J].Knowledge-Based Systems,2018,160:1-15.
[5]DAL P A .Learned lessons in credit card fraud detection from a practitioner perspective[J].Expert Systems with Applications,2014,41(10):4915-4928.
[6]TANG B,HE H.GIR-based Ensemble Sampling Approaches for Imbalanced Learning[J].Pattern Recognition,2017,71:306-319.
[7]BIAN J,PENG X G,WANG Y,et al.An Efficient Cost-Sensitive Feature Selection Using Chaos Genetic Algorithm for Class Imbalance Problem[J].Mathematical Problems in Engineering,2016,2016(6):1-9.
[8]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[9]HE H,GARCIA E A.Learning from Imbalanced Data[J].IEEE
Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
SAP C.Safe-Level-SMOTE:Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem[C]∥Pacific-Asia Conference on Advances in Know-ledge Discovery and Data Mining.Springer-Verlag,2009:475-482.
[11]WANG J H,DUAN B Q.Research on a density based SMOTE
method.CAAI Transactions on Intelligent Systems,2017(6):865-872.(in Chinese)
[12]CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating imbalance in network intrusion datasets[C]∥IEEE International Conference on Granular Computing.IEEE,2006:732-737.
[13]LIU Y X,LIU S M,LIU T,et al.A new oversampling algorithm DB-SMOTE[J].Computer Engineering and Applications,2014,50(6):92-95.(in Chinese)
[14]VOORHEES E M.Implementing agglomerative hierarchic clustering algorithms for use in document retrieval [J].Information Processing & Management,1986,22(6):465-476.
[15]CHEN S,GUO G D,CHEN L F.Unbalanced data classification method based on clustering fusion[j].Pattern Recognition and Artificial Intelligence,2010,23(6):772-780.(in Chinese)
[16]MATHEW J,PANG C K,LUO M,et al.Classification of Imba-
lanced Data by Oversampling in Kernel Space of Support Vector Machines.IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4065-4076.
[17]UCI Machine Learning Repository.http://archive.
[18]BOMBARA G,VASILE C I,PENEDO F,et al.A Decision Tree Approach to Data Classification using Signal Temporal Logic[C]∥International Conference on Hybrid Systems:Computation and Control.ACM,2016:1-10.
[1] LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[2] DONG Qi-da, WANG Zhe, WU Song-yang. Feature Fusion Framework Combining Attention Mechanism and Geometric Information [J]. Computer Science, 2022, 49(5): 129-134.
[3] JIANG Hao-chen, WEI Zi-qi, LIU Lin, CHEN Jun. Imbalanced Data Classification:A Survey and Experiments in Medical Domain [J]. Computer Science, 2022, 49(1): 80-88.
[4] HUANG Ying-qi, CHEN Hong-mei. Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification [J]. Computer Science, 2021, 48(9): 77-85.
[5] ZHANG Ren-jie, CHEN Wei, HANG Meng-xin, WU Li-fa. Detection of Abnormal Flow of Imbalanced Samples Based on Variational Autoencoder [J]. Computer Science, 2021, 48(7): 62-69.
[6] CHEN Jing-jie, WANG Kun. Interval Prediction Method for Imbalanced Fuel Consumption Data [J]. Computer Science, 2021, 48(7): 178-183.
[7] ZHANG Ren-zhi, ZHU Yan. Malicious User Detection Method for Social Network Based on Active Learning [J]. Computer Science, 2021, 48(6): 332-337.
[8] LU Shu-xia, ZHANG Zhen-lian. Imbalanced Data Classification of AdaBoostv Algorithm Based on Optimum Margin [J]. Computer Science, 2021, 48(11): 184-191.
[9] OUYANG Peng, LU Lu, ZHANG Fan-long, QIU Shao-jian. Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology [J]. Computer Science, 2020, 47(9): 10-16.
[10] CHEN Qing-chao, WANG Tao, FENG Wen-bo, YIN Shi-zhuang, LIU Li-jun. Unknown Binary Protocol Format Inference Method Based on Longest Continuous Interval [J]. Computer Science, 2020, 47(8): 313-318.
[11] CUI Wei, JIA Xiao-lin, FAN Shuai-shuai and ZHU Xiao-yan. New Associative Classification Algorithm for Imbalanced Data [J]. Computer Science, 2020, 47(6A): 488-493.
[12] SONG Ling-ling, WANG Shi-hui, YANG Chao, SHENG Xiao. Application Research of Improved XGBoost in Imbalanced Data Processing [J]. Computer Science, 2020, 47(6): 98-103.
[13] ZHANG Yun-fan,ZHOU Yu,HUANG Zhi-qiu. Semantic Similarity Based API Usage Pattern Recommendation [J]. Computer Science, 2020, 47(3): 34-40.
[14] YANG Hao, CHEN HONG-mei. Mixed-sampling Method for Imbalanced Data Based on Quantum Evolutionary Algorithm [J]. Computer Science, 2020, 47(11): 88-94.
[15] DONG Ming-gang,JIANG Zhen-long,JING Chao. Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm [J]. Computer Science, 2020, 47(1): 102-109.
Full text



No Suggested Reading articles found!