计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 50-55.doi: 10.11896/j.issn.1002-137X.2019.02.008
金旭1, 王磊1, 孙国梓1,2,3, 李华康1,2,3
JIN Xu1, WANG Lei1, SUN Guo-zi1,2,3, LI Hua-kang1,2,3
摘要: 针对目前的分类算法在不均衡数据集上的分类效果不理想的问题,将监督学习和无监督学习相结合,提出了一种基于质心的欠采样——ICIKMDS。在现实应用中,一些数据并不容易获得,或者不同类型的数据本身在数量上就存在着差异性,因此造成了数据集分布的不均,如疾病检测中疾病患者和正常人比例的不均、信用卡欺诈中欺诈用户和正常用户比例的不均等。所提方法很好地解决了数据集不均衡的问题,首先通过求解样本之间的欧氏距离得到初始质心,然后采用k-means算法在大类样本集上进行聚类,使不均衡数据集在分布上更加均衡,有效地改善了分类器的分类效果。所提方法使分类器在测试集小类上的分类准确率远远高于随机欠采样和SMOTE算法,在整个测试集上的准确率几乎与其他算法相同。
中图分类号:
[1]ZHAI Y,YANG B R,QU W.Overview of Imbalanced Data Mining [J].Computer Science,2010,37(10):27-32.(in Chinese) 翟云,杨炳儒,曲武.不平衡类数据挖掘研究综述[J].计算机科学,2010,37(10):27-32. [2]VISA S,RALESCU A.Issues in Mining Imbalanced Data Sets-A Review Paper[C]∥ Sixteen Midwest Artificial Intelligence & Cognitive Science Conference.2005:67-73. [3]XIE N N.Text classification algorithm based on unbalanced data set [D].Chongqing:Chongqing University,2013.(in Chinese) 谢娜娜.基于不均衡数据集的文本分类算法研究[D].重庆:重庆大学,2013. [4]YOU M Y,CHEN Y,LI G Z.New feature selection algorithm in unbalanced problem Im-IG [J].Journal of Shandong University (Engineering Science),2010,40(5):123-128.(in Chinese) 尤鸣宇,陈燕,李国正.不均衡问题中的特征选择新算法Im-IG[J].山东大学学报(工学版),2010,40(5):123-128. [5]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[M]∥Huang DS.,Zhang XP.,Huang GB.(eds) Advances in Intelligent Computing.Berlin:Springer,2005:878-887. [6]FAN X N.Data imbalance classification problem[D].Hefei:University of Science and Technology of China,2011.(in Chinese) 范先念.数据不平衡分类问题研究[D].合肥:中国科学技术大学,2011. [7]UÑEZ H,GONZALEZ-ABRIL L,ANGULO C.Improving SVM Classification on Imbalanced Datasets by Introducing a New Bias[J].Journal of Classfication,2017,34(3):427-443. [8]ABDELLATIF S,BEN HASSINE M A,BEN YAHIA S,et al.ARCID:A New Approach to Deal with Imbalanced Datasets Classification[C]∥International Conference on Current Trends in Theory and Practice of Informatics.2018:569-580. [9]ZHANG Y,FU P P,ZHANG Y T.Large scale data classification based on hierarchical clustering and resampling[J].Journal of Computer Applications,2013,33(10):2801-2803.(in Chinese) 张永,浮盼盼,张玉婷.基于分层聚类及重采样的大规模数据分类[J].计算机应用,2013,33(10):2801-2803. [10]CHAWLA N,BOWYER K,HALL L,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. [11]LONGADGE R,DONGRE S.Class Imbalance Problem in Data Mining:Review[C]∥International Journal of Computer Science and Network,2013,2(1). [12]GUO Y W,LIU X X.Study on the Method of Information Gain Feature Selection in Chinese Text Classification [J].Computer Engineering and Applications.2012,48(27):119-122.(in Chinese) 郭亚维,刘晓霞.中文文本分类中信息增益特征选择方法的研究[J].计算机工程与应用.2012,48(27):119-122. [13]YIN L Z,GE Y,XIAO K L,et al.Feature selection for high-dimensional unbalanced data[J].Neurocomputing,2013,105(1):3-11. [14]BAI P,ZHANG X B,ZHANG B,et al.Support vector machine theory and engineering application examples [M].Xi’an:Xi’an University of Electronic Science and Technology Press,2008:28-30.(in Chinese) 白鹏,张喜斌,张斌,等.支持向量机理论及工程应用实例[M].西安:西安电子科技大学出版社,2008:28-30. [15]AKBANI R ,KWEK S,JAPKOWICZ N.Applying Support Vector Machines to Imbalanced Datasets[C]∥ European Conference on Machine Learning.Springer Berlin Heidelberg,2004. [16]DIAO C X.W-SVM Model for Unbalanced Data Set Classification [D].Hefei:Hefei University of Technology,2012.(in Chinese) 刁翠霞.面向不均衡数据集分类的W-SVM模型[D].合肥:合肥工业大学,2012. [17]GUO H Y,VIKTOR H L.Learning from Imbalanced Data Sets with Boosting and Data Generation:The DataBoost-IM Approach[J].Acm Sigkdd Explorations Newsletter,2004,6(1):30-39. |
[1] | 储安琪, 丁志军. 基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理 Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation 计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075 |
[2] | 徐守坤, 倪楚涵, 吉晨晨, 李宁. 基于YOLOv3的施工场景安全帽佩戴的图像描述 Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3 计算机科学, 2020, 47(8): 233-240. https://doi.org/10.11896/jsjkx.190600109 |
[3] | 崔巍, 贾晓琳, 樊帅帅, 朱晓燕. 一种新的不均衡关联分类算法 New Associative Classification Algorithm for Imbalanced Data 计算机科学, 2020, 47(6A): 488-493. https://doi.org/10.11896/JsJkx.190600132 |
[4] | 钟雅,郭渊博,刘春辉,李涛. 内部威胁检测中用户属性画像方法与应用 User Attributes Profiling Method and Application in Insider Threat Detection 计算机科学, 2020, 47(3): 292-297. https://doi.org/10.11896/jsjkx.190200379 |
[5] | 焦扬, 杨传颖, 石宝. 基于SVM相关反馈的鞋印图像检索算法 Relevance Feedback Method Based on SVM in Shoeprint Images Retrieval 计算机科学, 2020, 47(11A): 244-247. https://doi.org/10.11896/jsjkx.200400032 |
[6] | 文俊浩,万园,曾骏,王喜宾,梁冠中. 光照度聚类和支持向量机在路灯节能控制策略中的应用 Application of Illumination Clustering and SVM in Energy-saving Control Strategy of Street Lamps 计算机科学, 2019, 46(7): 327-332. https://doi.org/10.11896/j.issn.1002-137X.2019.07.050 |
[7] | 蒋华,武尧,王鑫,王慧娇. 改进K均值聚类的海洋数据异常检测算法研究 Study on Ocean Data Anomaly Detection Algorithm Based on Improved K-means Clustering 计算机科学, 2019, 46(7): 211-216. https://doi.org/10.11896/j.issn.1002-137X.2019.07.032 |
[8] | 刘长齐, 邵堃, 霍星, 范冬阳, 檀结庆. 基于加权质量评价函数的K-means图像分割算法 K-means Image Segmentation Algorithm Based on Weighted Quality Evaluation Function 计算机科学, 2019, 46(6A): 158-160. |
[9] | 侯媛媛, 何儒汉, 李敏, 陈佳. 结合卷积神经网络多层特征融合和K-Means聚类的服装图像检索方法 Clothing Image Retrieval Method Combining Convolutional Neural Network Multi-layerFeature Fusion and K-Means Clustering 计算机科学, 2019, 46(6A): 215-221. |
[10] | 黄海燕, 刘晓明, 孙华勇, 杨志才. 聚类分析算法在不确定性决策中的应用 Application of Clustering Analysis Algorithm in Uncertainty Decision Making 计算机科学, 2019, 46(6A): 593-597. |
[11] | 刘树栋, 魏嘉敏. 基于谱聚类和成对数据表示的多层感知机分类算法 Multilayer Perceptron Classification Algorithm Based on Spectral Clusteringand Simultaneous Two Sample Representation 计算机科学, 2019, 46(11A): 194-198. |
[12] | 林涛, 赵璨. 最近邻优化的k-means聚类算法 Nearest Neighbor Optimization k-means Clustering Algorithm 计算机科学, 2019, 46(11A): 216-219. |
[13] | 胡梦琪, 郑继明. 基于量化颜色特征和SURF检测器的图像盲鉴别算法 Blind Image Identification Algorithm Based on HSV Quantized Color Feature and SURF Detector 计算机科学, 2019, 46(11A): 268-272. |
[14] | 张士翔, 李汪根, 李童, 朱楠楠. 一种改进的贝叶斯逻辑回归核心集构建算法 Improved CoreSets Construction Algorithm for Bayesian Logistic Regression 计算机科学, 2019, 46(11A): 98-102. |
[15] | 王卫红, 陈骁, 吴炜, 高星宇. 高分影像复杂背景下的城市水体自动提取方法 Method of Automatically Extracting Urban Water Bodies from High-resolution Images with Complex Background 计算机科学, 2019, 46(11): 277-283. https://doi.org/10.11896/jsjkx.181001985 |
|