计算机科学 ›› 2013, Vol. 40 ›› Issue (4): 131-135.

• 信息安全 • 上一篇    下一篇

不平衡数据分类方法及其在入侵检测中的应用研究

江颉,王卓芳,GONG Rong-sheng,陈铁明   

  1. 浙江工业大学计算机科学与技术学院杭州310023;浙江工业大学计算机科学与技术学院杭州310023;美国辛辛那提大学智能系统实验室,辛辛那提45221;浙江工业大学计算机科学与技术学院杭州310023
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受国家自然科学基金(61103044),浙江省自然科学基金(Y1110567),浙江省科技厅计划项目(2010C31126,2011C21046)资助

Imbalanced Data Classification Method and its Application Research for Intrusion Detection

JIANG Jie,WANG Zhuo-fang,GONG Rong-sheng and CHEN Tie-ming   

  • Online:2018-11-16 Published:2018-11-16

摘要: 直接将传统的分类方法应用于不平衡数据集时,往往导致少数类的分类精度低下。提出一种基于K-S统计的不平衡数据分类方法,以有效提高少数类的识别率。利用K-S统计评估分类与特征之间的关系,去除冗余特征,并且构建K-S决策树获得数据分片,调整数据的不平衡度;最后对分片数据双向抽样调整,进行分类学习。该方法使用的K-S统计假设条件极易满足,其效率高且适用性强。通过KDD99入侵检测数据的分析对比表明,对于不平衡的数据集,该方法对多数类及少数类都具有较高的分类精度。

关键词: 不平衡数据,K-S统计,逻辑回归,入侵检测

Abstract: The traditional classification algorithms always have low classification accuracy rate especially for the minorityclass when they are directly employed on classifying imbalanced datasets.A K-S statistic based new classification method for imbalanced data was proposed to enhance the performance of minority class recognition.At first,the K-S statistic was employed as a correlation measure to remove redundant variables.Then a K-S based decision tree was built to segment the training data into several subsets.Finally,two-way resampling methods,forward and backward,were used to rebuild the segmentation datasets as to implement more reasonable classification learning.The proposed K-S based method,with a realistic assumption,is very high efficient and widely applicable.The KDD99intrusion detection experimental analysis proves that the method has high classification accuracy rate of both minority and majority class for imbalanced datasets.

Key words: Imbalanced data,K-S statistic,Logistic regression,Intrusion detection

[1] Ling C X,Li C.Data mining for direct marketing:Problems and solutions[C]∥Proceedings of the 4th international conference on knowledge discovery and data mining.New York,NY,1998:73-79
[2] Sun Yan-min,Kamel M S,Wong A K C,et al.Cost-Sensitive Boosting for Classification of Imbalanced Data [J].Pattern Re-cognition,2007,40(12):3358-3378
[3] Estabrooks A,Jo T,Japkowicz N.A multiple resampling method for learning from imbalanced data sets [J].Computational Intelligence,2004,20(1):18-36
[4] Japkowicz N,Stephen S.The class imbalance problem:A systematic study [J].Intelligent Data Analysis,2002,6(5):429-450
[5] Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:Synthetic minority over-sampling techniques [J].Journal of Artificial Research,2002,16:321-357
[6] Drummond C,Holte R C.C4.5,Class imbalance,and cost sensitivity:Why under-sampling beats over-sampling [C]∥Procee-dings of the ICML’03Workshop on Learning from Imbalanced Data Sets.2003
[7] Kubat M,Matwin S.Addressing the curse of imbalanced training sets:one-sided selection [C]∥Proceedings of the 14th International Conference on Machine Learning.1997:179-186
[8] Holte R C,Acker L E,Porter B W.Concept learning and the problem of small disjuncts[C]∥Proceedings of the 11th joint international conference on artificial intelligence.1989:813-818
[9] Weiss G M.Mining with rarity:A unifying framework [J].ACM SIGKDD Explorations Newsletter-Special Issue on Lear-ning from Imbalanced Datasets,2004,6(1):7-19
[10] Quinlan J R.Improved estimates for the accuracy of small disjuncts [J].Machine Learning,1991,6(1):93-98
[11] Ling C X,Sheng V,Yang Q.Test strategies for cost-sensitive decision trees [J].IEEE Transactions on Knowledge and Data Engineering,2006,18(8):1055-1067
[12] Veropoulos K,Campbell C,Cristianini N.Controlling the sensitivity of support vector machines [C]∥Proceedings of international joint conference on artificial intelligence.1999:55-66
[13] Zheng Z,Wu X,Srihari R.Feature selection for text categorization on imbalanced Data [J].SIGKDD Explorations,2004,6(1):80-89
[14] Larose D T.数据挖掘方法与模型[M].北京:高等教育出版社,2011:143-146
[15] Han H,Wang W Y,Mao B H.Borderline-SMOTE:A New Over-Sampling Method in imbalanced Data Sets Learning[C]∥Proceedings of the International Conference on Intelligent Computing.Hefei,China,2005:878-887

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!