计算机科学 ›› 2020, Vol. 47 ›› Issue (6A): 488-493.doi: 10.11896/JsJkx.190600132

• 数据库 & 大数据 & 数据科学 • 上一篇    下一篇

一种新的不均衡关联分类算法

崔巍, 贾晓琳, 樊帅帅, 朱晓燕   

  1. 西安交通大学计算机科学与技术学院 西安 710049
  • 发布日期:2020-07-07
  • 通讯作者: 朱晓燕(zhu.xy@xJtu.edu.cn)
  • 作者简介:wwwcuiwei@stu.xJtu.edu.cn
  • 基金资助:
    国家自然科学基金(61402355,61502378)

New Associative Classification Algorithm for Imbalanced Data

CUI Wei, JIA Xiao-lin, FAN Shuai-shuai and ZHU Xiao-yan   

  1. School of Computer Science and Technology,Xi’an Jiaotong University,Xi’an 710049,China
  • Published:2020-07-07
  • About author:CUI Wei, born in 1994, postgraduate.His main research interests include machine learning and data mining.
    ZHU Xiao-yan, born in 1983, Ph.D, associate professor, is a member of China Computer Federation.Her main research interests include machine lear-ning and data mining.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61402355,61502378).

摘要: 基于规则的分类算法具有分类性能好、可解释性强的优点,得到了广泛的应用。然而已有的基于规则的分类算法没有考虑不均衡数据的情况,从而影响了其对不均衡数据的分类效果。文中提出了一种新的不均衡关联分类算法ACI。首先生成所有的关联规则,然后使用不均衡规则裁剪方法进行规则裁剪。最后,将剩余规则存储到CR树中,用于新实例的分类。在27个公开数据集上的实验结果表明,提出的不均衡关联分类算法在不均衡数据集上比基准算法的分类效果更好。

关键词: 不均衡数据, 分类, 关联规则

Abstract: The rule-based classification algorithms,which have good classification performance and interpretability,have been widely used.However,the existing rule-based classification algorithms do not consider the case of imbalanced data,thus affect their classification effect on imbalanced data.In this paper,a new associative classification algorithm ACI for imbalanced data is proposed.Firstly,all the association rules are generated.Then,the rules are pruned by an imbalanced rule pruning method.Finally,the remaining rules are saved in a CR Tree for new instance classification.Experimental results on 27 public data sets show that the proposed algorithm performs better than the compared algorithms.

Key words: Association rule, Classification, Imbalanced data

中图分类号: 

  • TP312
[1] HIERONS R.Machine learning.Tom M.Published by McGraw-Hill,Maidenhead,U.K.,International Student Edition,1997.ISBN:0-07-115467-1,414 pages.Price:U.K.£22.99,soft co-ver.Software Testing Verification & Reliability,2015,9(3):191-193.
[2] SALZBERG S L J M L.C4.5:Programs for Machine Learning by J.Ross Quinlan.Morgan Kaufmann Publishers,Inc.,1993.Machine Learning,1994,16(3):235-240.
[3] RAJPUT A.J48 and JRIP Rules for E-Governance Data.IJCSS,2011,5(2):201.
[4] FRNKRANZ J,WIDMER G.Incremental Reduced ErrorPruning//Machine Learning Proceedings.1994:70-77.
[5] HU K,LU Y,ZHOU L,et al.Integrating classification and association rule mining:A concept lattice framework//International Workshop on Rough Sets,Fuzzy Sets,Data Mining,and Granular-Soft Computing.Springer.1999:443-447.
[6] LI W,HAN J,PEI J.CMAR:Accurate and efficient classification based on multiple class-association rules//Proceedings IEEE International Conference on Data Mining,2001(ICDM 2001).IEEE,2001:369-376.
[7] THABTAH F A,COWLING P,PENG Y.MMAC:A New Multi-Class,Multi-Label Associative Classification Approach//IEEE International Conference on Data Mining.2004.
[8] ZHU X,SONG Q,JIA Z.A Weighted Voting-Based Associative Classification Algorithm.The Computer Journal,2010,53(6):786-801.
[9] GANGANWAR V.An overview of classification algorithms for imbalanced datasets.International Journal of Emerging Technology and Advanced Engineering,2012,2(4):42-47.
[10] HE H,MA Y.Imbalanced learning.Foundations,algorithms, and applications.Wiley-IEEE Press,2013.
[11] ZHOU Z H,LIU X Y.On multi-class cost-sensitive learning//National Conference on Artificial Intelligence.2006.
[12] WU G,CHANG E Y.KBA:Kernel boundary alignment consi-dering imbalanced data distribution.IEEE Transactions on Knowledge & Data Engineering,2005(6):786-795.
[13] BREIMAN L.Bagging predictors.Machine Learning,1996, 24(2):123-140.
[14] ZAREAPOOR M,SHAMSOLMOALI P.Application of credit card fraud detection:Based on bagging ensemble classifier.Procedia computer science,2015,48(2015):679-685.
[15] WITTEN I H,FRANKE,HALL M A,et al.Data Mining: Practical machine learning tools and techniques.Morgan Kaufmann,2016:70-71.
[16] 韩家炜,坎伯.数据挖掘:概念与技术.北京:机械工业出版社,2012:158-159.
[17] DEORA C S,ARORA S,MAKANI Z.Comparison ofInteres-tingness Measures:Support-Confidence Framework versus Lift-Irule Framework.International Journal of Enginnering Research & Applications,2014,3(2):208-215.
[18] ALCAL-FDEZ J,FERNNDEZ A,LUENGO J,et al.KEEL Data-Mining Software Tool:Data Set Repository,Integration of Algorithms and Experimental Analysis Framework.Journal of Multiple-Valued Logic & Soft Computing,2011,17:255-287.
[19] PATIL T R,SHEREKAR S.Performance analysis of Naive Bayes and J48 classification algorithm for data classification.International Journal of Computer Science and Applications,2013,6(2):256-261.
[20] QUINLAN J R.Bagging,boosting,and C4.5//AAAI/IAAI.1996:725-730.
[21] LOBO J M,JIMNEZ-VALVERDE A,REAL R.AUC:a misleading measure of the performance of predictive distribution models.Global Ecology and Biogeography,2008,17(2):145-151.
[22] DAVIS J,GOADRICH M.The relationship between Precision-Recall and ROC curves//Proceedings of the 23rd International Conference on Machine Learning.ACM,2006:233-240.
[23] POWERS D M.Evaluation:from precision,recall and F-measure to ROC,informedness,markedness and correlation.Journal of Machine Learning Technology,2011,2(1):37-63.
[24] WILCOXON F,KATTI S,WILCOX R A.Critical values and probability levels for the Wil-coxon rank sum test and the Wil-coxon signed rank test.Selected Tables in Mathematical Statistics,1970,1:171-259.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[4] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[5] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[7] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[8] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[9] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[10] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[11] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[12] 杨健楠, 张帆.
一种结合双注意力机制和层次网络结构的细碎农作物分类方法
Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure
计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
[13] 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行.
基于步态分类辅助的虚拟IMU的行人导航方法
Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification
计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148
[14] 杜丽君, 唐玺璐, 周娇, 陈玉兰, 程建.
基于注意力机制和多任务学习的阿尔茨海默症分类
Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning
计算机科学, 2022, 49(6A): 60-65. https://doi.org/10.11896/jsjkx.201200072
[15] 李小伟, 舒辉, 光焱, 翟懿, 杨资集.
自然语言处理在简历分析中的应用研究综述
Survey of the Application of Natural Language Processing for Resume Analysis
计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!