计算机科学 ›› 2021, Vol. 48 ›› Issue (7): 62-69.doi: 10.11896/jsjkx.200600022
所属专题: 人工智能安全
张仁杰, 陈伟, 杭梦鑫, 吴礼发
ZHANG Ren-jie, CHEN Wei, HANG Meng-xin, WU Li-fa
摘要: 随着机器学习技术的快速发展,越来越多的机器学习算法被用于攻击流量的检测与分析,然而攻击流量往往只占网络流量中极小的一部分,在训练机器学习模型时存在训练集正负样本不平衡的问题,从而影响模型训练效果。针对不平衡样本问题,文中提出了一种基于变分自编码器的不平衡样本生成方法,其核心思想是在对少数样本进行扩充时,不是对全部进行扩充,而是分析这些少数样本,对其中最容易对机器学习产生混淆效果的少数边界样本进行扩充。首先,利用KNN算法筛选出少数类样本中与多数类样本最近的样本;其次,使用DBSCAN算法对KNN算法筛选出的部分样本进行聚类处理,生成一个或多个子簇;然后,设计变分自编码网络模型,对DBSCAN算法区分出的一个或多个子簇中的少数类样本进行学习扩充,并将扩充后的样本加入原有样本中用于构建新的训练集;最后,利用新构建的训练集来训练决策树分类器,从而实现异常流量的检测。选择召回率和F1分数作为评价指标,分别以原始样本、SMOTE生成样本、SMOTE改进方法生成样本和文中所提方法生成样本为训练集进行对比实验。实验结果表明,在4种异常类型中,采用所提算法构造训练集训练的决策树分类器在召回率和F1分数上都有提升,F1分数相比原始样本及SMOTE方法最高提升了20.9%。
中图分类号:
[1]China Internet Network Information Center.The 44th statistical report on the development of Internet in China[J].Internet World,2019(10):74-91. [2]ZHANG Y Q,ZHOU W,PENG A N.Overview of Internet of things security [J].Computer Research and Development,2017,54(10):2130-2143. [3]GUI C N.Global Internet of things attacks increased by 280% in the first half of 2017[J].China Information Security,2017(9):10. [4]ZHAO X.Design and implementation of network traffic detection system[D].Northeast Normal University,2011. [5]ZHANG Y Q,DONG Y,LIU C Y,et al.Current situation,trend and Prospect of deep learning application in Cyberspace Security [J].Computer Research and Development,2018,55 (6):1117-1142. [6]KANG S L,FAN X P,LIU L,et al.Research on P2P Botnets Detection Based on the ENN-ADASYN-SVM Classification Algorithm[J].Journal of Chinese Computer Systems,2016,37(2):216-220. [7]MO Z,GAI Y R,FAN G L.Credit card fraud classification based on GAN-AdaBoost-DT imbalanced classification algorithm[J].Journal of Computer Applications,2019,39(2):618-622. [8]KIM J H.Time Frequency Image and Artificial Neural Network Based Classification of Impact Noise for Machine Fault Diagnosis[J].International Journal of Precision Engineering and Manufacturing,2018,19(6):821-827. [9]PUN J,LAWRYSHYN Y.Improving Credit Card Fraud Detection using a Meta-Classification Strategy[J].International Journal of Computer Applications,2012,56(10):41-46. [10]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. [11]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]//IEEE International Joint Conference on Neural Networks(IJCNN 2008).IEEE,2008. [12]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[C]//International Conference on Intelligent Computing.Berlin,Heidelberg:Springer,2005:878-887. [13]ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems[J].Pattern Recognition,2017,72:327-340. [14]DOUZAS G,BACAO F.Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE[J].Information Sciences,2019,501:118-135. [15]CASTRO C L,BRAGA A P.Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data[J].IEEE Transactions on Neural Networks and Learning Systems,2013,24(6):888-899. [16]LI Y,LIU Z D,ZHANG H J.Overview of integrated classification algorithm for unbalanced data[J].Computer Application Research,2014,31(5):1287-1291. [17]GALAR M,FERNANDEZ A,BARRENECHEA E,et al.A Review on Ensembles for the Class Imbalance Problem:Bagging-,Boosting-,and Hybrid-Based Approaches[J].IEEE Transactions on Systems,Man and Cybernetics,Part C (Applications and Reviews),2012,42(4):463-484. [18]SHI J R,MA Y Y.Research progress and development of deep learning[J].Computer Engineering and Application,2018,905(10):6-15. [19]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes[J].arXiv:1312.6114.2013. [20]LIU F.Research on the theory and application of deep self en-coder [D].Wuxi:Jiangnan University,2018. [21]MA H Q,MA S P,XU Y L,et al.Image denoising[J].Compu-ter Engineering and Application,2018,54(4):199-204,236. [22]YIN B C,WANG W T,WANG L C.A review of deep learning research[J].Journal of Beijing University of Technology,2015 (1):48-59. [23]ZENG X Y,YANG Y,WANG S Y,et al.A hybrid recommendation algorithm based on deep learning[J].Computer Science,2019,46(1):126-130. [24]LIU S,HUANG Y,HU J,et al.Learning local responses of facial landmarks with conditional variational auto-encoder for face alignment[C]//2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).IEEE,2017:947-952. [25]OSADA G,OMOTE K,NISHIDE T.Network intrusion detection based on semi-supervised variational auto-encoder[C]//European Symposium on Research in Computer Security.Cham:Springer,2017:344-361. [26]ZHAI Z L,LIANG Z M,ZHOU W,et al.Review of variationalself encoder models[J].Computer Engineering and Application,2019,55(3):1-9. [27]MOUSTAFA N,SLAY J.UNSW-NB15:a comprehensive dataset for network intrusion detection systems (UNSW-NB15 network data set)[C]//2015 Military Communications and Information Systems Conference (MilCIS).IEEE,2015. |
[1] | 王冠宇, 钟婷, 冯宇, 周帆. 基于矢量量化编码的协同过滤推荐方法 Collaborative Filtering Recommendation Method Based on Vector Quantization Coding 计算机科学, 2022, 49(9): 48-54. https://doi.org/10.11896/jsjkx.210700109 |
[2] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[3] | 唐雨潇, 王斌君. 基于深度生成模型的人脸编辑研究进展 Research Progress of Face Editing Based on Deep Generative Model 计算机科学, 2022, 49(2): 51-61. https://doi.org/10.11896/jsjkx.210400108 |
[4] | 江昊琛, 魏子麒, 刘璘, 陈俊. 非均衡数据分类经典方法综述与面向医疗领域的实验分析 Imbalanced Data Classification:A Survey and Experiments in Medical Domain 计算机科学, 2022, 49(1): 80-88. https://doi.org/10.11896/jsjkx.210200124 |
[5] | 黄颖琦, 陈红梅. 基于代价敏感卷积神经网络的非平衡问题混合方法 Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification 计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013 |
[6] | 赵志强, 易秀双, 李婕, 王兴伟. 基于GR-AD-KNN算法的IPv6网络DoS入侵检测技术研究 Research on DoS Intrusion Detection Technology of IPv6 Network Based on GR-AD-KNN Algorithm 计算机科学, 2021, 48(6A): 524-528. https://doi.org/10.11896/jsjkx.200500001 |
[7] | 黄铭, 孙林夫, 任春华, 吴奇石. 改进KNN的时间序列分析方法 Improved KNN Time Series Analysis Method 计算机科学, 2021, 48(6): 71-78. https://doi.org/10.11896/jsjkx.200500044 |
[8] | 贺苗苗, 郭卫斌. 基于KNN与矩阵变换的图节点嵌入归纳式学习算法 Inductive Learning Algorithm of Graph Node Embedding Based on KNN and Matrix Transform 计算机科学, 2021, 48(3): 201-205. https://doi.org/10.11896/jsjkx.191200156 |
[9] | 富坤, 赵晓梦, 付紫桐, 高金辉, 马浩然. 基于不完全信息的深度网络表示学习方法 Deep Network Representation Learning Method on Incomplete Information Networks 计算机科学, 2021, 48(12): 212-218. https://doi.org/10.11896/jsjkx.201000015 |
[10] | 欧阳鹏, 陆璐, 张凡龙, 邱少健. 基于迁移学习和过采样技术的跨项目克隆代码一致性维护需求预测 Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology 计算机科学, 2020, 47(9): 10-16. https://doi.org/10.11896/jsjkx.200400041 |
[11] | 罗晋楠, 张济民. 基于扩展Haar特征和DBSCAN的钢轨识别算法 Rail Area Extraction Using Extended Haar-like Features and DBSCAN Clustering 计算机科学, 2020, 47(6A): 153-156. https://doi.org/10.11896/JsJkx.200100008 |
[12] | 邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用 Application of Improved DBSCAN Algorithm on Spark Platform 计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071 |
[13] | 张建新, 刘弘, 李焱. 一种面向人群疏散的高效分组方法 Efficient Grouping Method for Crowd Evacuation 计算机科学, 2019, 46(6): 231-238. https://doi.org/10.11896/j.issn.1002-137X.2019.06.035 |
[14] | 夏英, 李刘杰, 张旭, 裴海英. 基于层次聚类的不平衡数据加权过采样方法 Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data 计算机科学, 2019, 46(4): 22-27. https://doi.org/10.11896/j.issn.1002-137X.2019.04.004 |
[15] | 周晓敏, 曹付元, 余丽琴. 一种基于样本分层的双向过采样方法 Bi-directional Oversampling Method Based on Sample Stratification 计算机科学, 2019, 46(12): 83-88. https://doi.org/10.11896/jsjkx.190400053 |
|