计算机科学 ›› 2021, Vol. 48 ›› Issue (7): 62-69.doi: 10.11896/jsjkx.200600022

所属专题: 人工智能安全

• 人工智能安全* • 上一篇    下一篇

基于变分自编码器的不平衡样本异常流量检测

张仁杰, 陈伟, 杭梦鑫, 吴礼发   

  1. 南京邮电大学计算机学院、软件学院、网络空间安全学院 南京210023
  • 收稿日期:2020-06-02 修回日期:2020-08-23 出版日期:2021-07-15 发布日期:2021-07-02
  • 通讯作者: 陈伟(chenwei@njupt.edu.cn)
  • 基金资助:
    国家重点研发计划(2019YFB2101704)

Detection of Abnormal Flow of Imbalanced Samples Based on Variational Autoencoder

ZHANG Ren-jie, CHEN Wei, HANG Meng-xin, WU Li-fa   

  1. School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
  • Received:2020-06-02 Revised:2020-08-23 Online:2021-07-15 Published:2021-07-02
  • About author:ZHANG Ren-jie,born in 1995,M.S.candidate,is a student member of China Computer Federation.His main research interests include network security,machine learning.(zrj9582346@163.com)
    CHEN Wei,born in 1979,Ph.D,professor,is a member of China Computer Federation.His main research interests include wireless network security,mobile Internet security.
  • Supported by:
    National Key Research and Development Project(2019YFB2101704).

摘要: 随着机器学习技术的快速发展,越来越多的机器学习算法被用于攻击流量的检测与分析,然而攻击流量往往只占网络流量中极小的一部分,在训练机器学习模型时存在训练集正负样本不平衡的问题,从而影响模型训练效果。针对不平衡样本问题,文中提出了一种基于变分自编码器的不平衡样本生成方法,其核心思想是在对少数样本进行扩充时,不是对全部进行扩充,而是分析这些少数样本,对其中最容易对机器学习产生混淆效果的少数边界样本进行扩充。首先,利用KNN算法筛选出少数类样本中与多数类样本最近的样本;其次,使用DBSCAN算法对KNN算法筛选出的部分样本进行聚类处理,生成一个或多个子簇;然后,设计变分自编码网络模型,对DBSCAN算法区分出的一个或多个子簇中的少数类样本进行学习扩充,并将扩充后的样本加入原有样本中用于构建新的训练集;最后,利用新构建的训练集来训练决策树分类器,从而实现异常流量的检测。选择召回率和F1分数作为评价指标,分别以原始样本、SMOTE生成样本、SMOTE改进方法生成样本和文中所提方法生成样本为训练集进行对比实验。实验结果表明,在4种异常类型中,采用所提算法构造训练集训练的决策树分类器在召回率和F1分数上都有提升,F1分数相比原始样本及SMOTE方法最高提升了20.9%。

关键词: DBSCAN, KNN, 变分自编码器, 不平衡样本, 过采样, 异常流量

Abstract: With the rapid development of machine learning technology,more and more machine learning algorithms are used to detect and analyze attack traffic.However,attack traffic often accounts for a very small portion of network traffic.When training machine learning models,there is often a problem of imbalance between the positive and negative samples of the training set,which affects model training effect.Aiming at the problem of imbalanced samples,an imbalanced sample generation method based on variational auto-encoder is proposed.The idea is that when expanding imbalanced samples,not all of them are expanded.But imbalanced samples are analyzed,and a small number of boundary samples that are most likely to have confusion effects on machine learning are expanded.First,the KNN algorithm is used to screen the samples that are closest to the majority of samples;second,DBSCAN algorithm is used to cluster the partial samples selected by the KNN algorithm to generate one or more sub-clusters;then,a VAE network model is designed to learn and expand the few samples in one or more sub-clusters distinguished by the DBSCAN algorithm.The expanded samples are added to the original samples to build a new training set;finally,the newly constructed training set is used to train decision tree classifier to detect abnormal traffic.The recall rate and F1 score are selected as the evaluation indicators.The original sample,the SMOTE-generated sample and our sample are compared.The experimental results show that the decision tree classifier trained using the proposed method in this paper has improved the recall rate and F1 score among the four types of anomalies.The F1 score is up to 20.9%,which is higher than the original sample and the SMOTE method.

Key words: Abnormal flow, DBSCAN, Imbalanced sample, KNN, Oversampling, Variational auto-encoder

中图分类号: 

  • TP391
[1]China Internet Network Information Center.The 44th statistical report on the development of Internet in China[J].Internet World,2019(10):74-91.
[2]ZHANG Y Q,ZHOU W,PENG A N.Overview of Internet of things security [J].Computer Research and Development,2017,54(10):2130-2143.
[3]GUI C N.Global Internet of things attacks increased by 280% in the first half of 2017[J].China Information Security,2017(9):10.
[4]ZHAO X.Design and implementation of network traffic detection system[D].Northeast Normal University,2011.
[5]ZHANG Y Q,DONG Y,LIU C Y,et al.Current situation,trend and Prospect of deep learning application in Cyberspace Security [J].Computer Research and Development,2018,55 (6):1117-1142.
[6]KANG S L,FAN X P,LIU L,et al.Research on P2P Botnets Detection Based on the ENN-ADASYN-SVM Classification Algorithm[J].Journal of Chinese Computer Systems,2016,37(2):216-220.
[7]MO Z,GAI Y R,FAN G L.Credit card fraud classification based on GAN-AdaBoost-DT imbalanced classification algorithm[J].Journal of Computer Applications,2019,39(2):618-622.
[8]KIM J H.Time Frequency Image and Artificial Neural Network Based Classification of Impact Noise for Machine Fault Diagnosis[J].International Journal of Precision Engineering and Manufacturing,2018,19(6):821-827.
[9]PUN J,LAWRYSHYN Y.Improving Credit Card Fraud Detection using a Meta-Classification Strategy[J].International Journal of Computer Applications,2012,56(10):41-46.
[10]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[11]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]//IEEE International Joint Conference on Neural Networks(IJCNN 2008).IEEE,2008.
[12]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[C]//International Conference on Intelligent Computing.Berlin,Heidelberg:Springer,2005:878-887.
[13]ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems[J].Pattern Recognition,2017,72:327-340.
[14]DOUZAS G,BACAO F.Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE[J].Information Sciences,2019,501:118-135.
[15]CASTRO C L,BRAGA A P.Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data[J].IEEE Transactions on Neural Networks and Learning Systems,2013,24(6):888-899.
[16]LI Y,LIU Z D,ZHANG H J.Overview of integrated classification algorithm for unbalanced data[J].Computer Application Research,2014,31(5):1287-1291.
[17]GALAR M,FERNANDEZ A,BARRENECHEA E,et al.A Review on Ensembles for the Class Imbalance Problem:Bagging-,Boosting-,and Hybrid-Based Approaches[J].IEEE Transactions on Systems,Man and Cybernetics,Part C (Applications and Reviews),2012,42(4):463-484.
[18]SHI J R,MA Y Y.Research progress and development of deep learning[J].Computer Engineering and Application,2018,905(10):6-15.
[19]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes[J].arXiv:1312.6114.2013.
[20]LIU F.Research on the theory and application of deep self en-coder [D].Wuxi:Jiangnan University,2018.
[21]MA H Q,MA S P,XU Y L,et al.Image denoising[J].Compu-ter Engineering and Application,2018,54(4):199-204,236.
[22]YIN B C,WANG W T,WANG L C.A review of deep learning research[J].Journal of Beijing University of Technology,2015 (1):48-59.
[23]ZENG X Y,YANG Y,WANG S Y,et al.A hybrid recommendation algorithm based on deep learning[J].Computer Science,2019,46(1):126-130.
[24]LIU S,HUANG Y,HU J,et al.Learning local responses of facial landmarks with conditional variational auto-encoder for face alignment[C]//2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).IEEE,2017:947-952.
[25]OSADA G,OMOTE K,NISHIDE T.Network intrusion detection based on semi-supervised variational auto-encoder[C]//European Symposium on Research in Computer Security.Cham:Springer,2017:344-361.
[26]ZHAI Z L,LIANG Z M,ZHOU W,et al.Review of variationalself encoder models[J].Computer Engineering and Application,2019,55(3):1-9.
[27]MOUSTAFA N,SLAY J.UNSW-NB15:a comprehensive dataset for network intrusion detection systems (UNSW-NB15 network data set)[C]//2015 Military Communications and Information Systems Conference (MilCIS).IEEE,2015.
[1] 王冠宇, 钟婷, 冯宇, 周帆.
基于矢量量化编码的协同过滤推荐方法
Collaborative Filtering Recommendation Method Based on Vector Quantization Coding
计算机科学, 2022, 49(9): 48-54. https://doi.org/10.11896/jsjkx.210700109
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 唐雨潇, 王斌君.
基于深度生成模型的人脸编辑研究进展
Research Progress of Face Editing Based on Deep Generative Model
计算机科学, 2022, 49(2): 51-61. https://doi.org/10.11896/jsjkx.210400108
[4] 江昊琛, 魏子麒, 刘璘, 陈俊.
非均衡数据分类经典方法综述与面向医疗领域的实验分析
Imbalanced Data Classification:A Survey and Experiments in Medical Domain
计算机科学, 2022, 49(1): 80-88. https://doi.org/10.11896/jsjkx.210200124
[5] 黄颖琦, 陈红梅.
基于代价敏感卷积神经网络的非平衡问题混合方法
Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification
计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013
[6] 赵志强, 易秀双, 李婕, 王兴伟.
基于GR-AD-KNN算法的IPv6网络DoS入侵检测技术研究
Research on DoS Intrusion Detection Technology of IPv6 Network Based on GR-AD-KNN Algorithm
计算机科学, 2021, 48(6A): 524-528. https://doi.org/10.11896/jsjkx.200500001
[7] 黄铭, 孙林夫, 任春华, 吴奇石.
改进KNN的时间序列分析方法
Improved KNN Time Series Analysis Method
计算机科学, 2021, 48(6): 71-78. https://doi.org/10.11896/jsjkx.200500044
[8] 贺苗苗, 郭卫斌.
基于KNN与矩阵变换的图节点嵌入归纳式学习算法
Inductive Learning Algorithm of Graph Node Embedding Based on KNN and Matrix Transform
计算机科学, 2021, 48(3): 201-205. https://doi.org/10.11896/jsjkx.191200156
[9] 富坤, 赵晓梦, 付紫桐, 高金辉, 马浩然.
基于不完全信息的深度网络表示学习方法
Deep Network Representation Learning Method on Incomplete Information Networks
计算机科学, 2021, 48(12): 212-218. https://doi.org/10.11896/jsjkx.201000015
[10] 欧阳鹏, 陆璐, 张凡龙, 邱少健.
基于迁移学习和过采样技术的跨项目克隆代码一致性维护需求预测
Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology
计算机科学, 2020, 47(9): 10-16. https://doi.org/10.11896/jsjkx.200400041
[11] 罗晋楠, 张济民.
基于扩展Haar特征和DBSCAN的钢轨识别算法
Rail Area Extraction Using Extended Haar-like Features and DBSCAN Clustering
计算机科学, 2020, 47(6A): 153-156. https://doi.org/10.11896/JsJkx.200100008
[12] 邓定胜.
一种改进的DBSCAN算法在Spark平台上的应用
Application of Improved DBSCAN Algorithm on Spark Platform
计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071
[13] 张建新, 刘弘, 李焱.
一种面向人群疏散的高效分组方法
Efficient Grouping Method for Crowd Evacuation
计算机科学, 2019, 46(6): 231-238. https://doi.org/10.11896/j.issn.1002-137X.2019.06.035
[14] 夏英, 李刘杰, 张旭, 裴海英.
基于层次聚类的不平衡数据加权过采样方法
Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data
计算机科学, 2019, 46(4): 22-27. https://doi.org/10.11896/j.issn.1002-137X.2019.04.004
[15] 周晓敏, 曹付元, 余丽琴.
一种基于样本分层的双向过采样方法
Bi-directional Oversampling Method Based on Sample Stratification
计算机科学, 2019, 46(12): 83-88. https://doi.org/10.11896/jsjkx.190400053
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!