基于TF-IDF的Webshell文件检测

doi:10.11896/jsjkx.200100064

计算机科学 ›› 2020, Vol. 47 ›› Issue (11A): 363-367.doi: 10.11896/jsjkx.200100064

基于TF-IDF的Webshell文件检测

赵瑞杰¹, 施勇^1,2, 张涵¹, 龙军¹, 薛质^1,2

1 上海交通大学网络空间安全学院上海 200240
2 上海市信息安全综合管理技术实验室上海 200240

出版日期:2020-11-15 发布日期:2020-11-17
通讯作者: 薛质 (zxue@sjtu.edu.cn)
作者简介:ruijiezhao@sjtu.edu.cn
基金资助:
国家重点研发计划项目(2017YFB0803200)

Webshell File Detection Method Based on TF-IDF

ZHAO Rui-jie¹, SHI Yong^1,2, ZHANG Han¹, LONG Jun¹, XUE Zhi^1,2

1 School of Cyber Science and Engineering,Shanghai JiaoTong University,Shanghai 200240,China
2 Shanghai Information Security Integrated Management Technology Laboratory,Shanghai 200240,China

Online:2020-11-15 Published:2020-11-17
About author:ZHAO Rui-jie,born in 1995,postgra-duate.His main research interests include network security,machine lear-ning and data mining.
XUE Zhi,born in 1971,Ph.D,professor,Ph.D supervisor.His main research interests include cyber security,communication technology and machine learning.
Supported by:
This work was supported by the National Key R&D Programe of China(2017YFB0803200).

摘要/Abstract

摘要： 随着互联网的飞速发展,网络攻击行为日益频繁。Webshell是常见的网络攻击方式,而传统的检测手段已无法应对复杂灵活的变种 Webshell攻击。为解决这一问题,提出了一种基于TF-IDF的Webshell文件检测方法。系统首先对不同类型的Webshell文件进行分类,并对不同文件进行相应的预处理转码,以降低混淆干扰技术对检测的影响;随后建立词袋模型,并采用TF-IDF算法加权提取相关特征;最后使用XGBoost算法训练得到检测模型。与传统机器学习算法进行的10折交叉验证对比测试表明,使用TF-IDF算法预处理后结合XGBoost算法的Webshell文件检测模型性能出色,检测效果相较于传统检测方法在准确率、精确率、召回率等方面均有所提高,同时具备更强的鲁棒性与泛化能力,其中对PHP类型文件检测的准确率达到了98.09%,对JSP类型文件检测准确率达到了97.09%。

关键词: TF-IDF, Webshell检测, XGBoost算法, 多层神经网络, 交叉验证, 随机森林, 特征提取, 支持向量机

Abstract: With the rapid development of Internet,cyber attacks are becoming more frequent.Webshell is a common cyber attack method,and traditional detection methods are unable to cope with complex and flexible variants of Webshell attacks.In order to solve this problem,webshell detection method based on TF-IDF is proposed.First of all,the system classifies Webshell files and transcodes different files accordingly to reduce the impact of confusion and interference technology on detection,then build a bag of words model and use TF-IDF algorithm to weight extract relevant features,and finally uses the XGBoost algorithm to train the detection model.Compared with the traditional machine learning algorithm,the Webshell detection model based on TF-IDF and XGBoost algorithm has higher accuracy than the traditional detection method,and has stronger robustness and generalization capabilities.The detection accuracy of XGBoost algorithm for PHP type files can reach 98.09%,and the accuracy for JSP type files can reach 97.09%.

Key words: Cross validation, Feature extraction, Multi-layer perception, Random forest, Support vector machine, TF-IDF, Webshell detection, XGBoost algorithm

中图分类号:

TP393

赵瑞杰, 施勇, 张涵, 龙军, 薛质. 基于TF-IDF的Webshell文件检测[J]. 计算机科学, 2020, 47(11A): 363-367. https://doi.org/10.11896/jsjkx.200100064

ZHAO Rui-jie, SHI Yong, ZHANG Han, LONG Jun, XUE Zhi. Webshell File Detection Method Based on TF-IDF[J]. Computer Science, 2020, 47(11A): 363-367. https://doi.org/10.11896/jsjkx.200100064

参考文献

[1] SHI L Y,FANG Y.Research on Webshell Detection MethodBased on Web Log [J].Information Security Research,2016,2(1):66-73.
[2] DAI H,LI J,LU X Y,et al.Machine learning algorithm for intelligent detection of WebShell [J].Journal of Network and Information Security,2017,3(4):51-57.
[3] GOLDBERG D E.Genetic algorithms in search,optimization and machine learning[M].Addison-wesley Longman Publishing Co.,1989.
[4] BUCZAK A L,GUVEN E.A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection[J].IEEE Communications Surveys & Tutorials,2017,18(2):1153-1176.
[5] XIAO H,RASUL K,VOLLGRAF R.Fashion-MNIST:a Novel Image Dataset for Benchmarking Machine Learning Algorithms[J].arXiv:1708.07747,2017.
[6] STOLFO S J,LEE W.A data mining framework for constructing features and models for intrusion detection systems (computer security,network security)[M].Columbia University 2960 Broadway,1999:227-261.
[7] YE F,GONG J,YANG W.Webshell black box detection based on support vector machine [J].Journal of Nanjing University of Aeronautics and Astronautics,2015(6):924-930.
[8] FU J M,LI L,WANG Y J.Webshell File Detection Based on CNN [J].Journal of Zhengzhou University (Science Edition),2019,51(2):4-11.
[9] QI J J.Stealing WebShell Detection Method [J].Computer and Network,2015(13):38-39.
[10] MEI R,ZHANG T.Research on WebShell detection methodbased on SVM classifier under Linux [J].Information Network Security,2014(5):5-9.
[11] CHI Y P,LING Z T,WANG Z Q,et al.Intrusion Detection System Based on Support Vector Machine and Adaboost [J/OL].Computer Engineering,2019,45(10):183-188.
[12] WANG Y.Design and implementation of pedestrian detection algorithm based on random gradient boosting decision tree [D].Hangzhou:Zhejiang University,2017.
[13] CHEN J,LI K,TANG Z,et al.A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment[J].IEEE Transactions on Parallel & Distributed Systems,2017,PP(99):1-1.
[14] TU X Y,YU L,GENG Z C,et al.A Method for Early Warning of Leakage Accidents Based on Large-scale Time Series [J].Information Technology,2018,42(12):1-4.
[15] CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2016:785-794.
[16] ZHENG H,YUAN J,CHEN L.Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation[J].Energies,2017,10(8):1168.
[17] LI Y F,WANG Y,LI J H.Repeatability of several cross-validation tests [J].Journal of Taiyuan Normal University (Natural Science Edition),2013(4):46-49.
[18] WANG K,HOU Z R,WANG C L.Network Intrusion Detection Based on Cross-Validation SVM [J].Journal of Test and Measurement Technology,2010,24(5):419-423.
[19] GUTLEIN M,HELMA C,KARWATH A,et al.A Large-Scale Empirical Evaluation of Cross-Validation and External Test Set Validation in (Q)SAR[J].Molecular Informatics,2013,32(5/6):516-528.

相关文章 15

[1]	张源, 康乐, 宫朝辉, 张志鸿. 基于Bi-LSTM的期货市场关联交易行为检测方法 Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM 计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[2]	高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[3]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[4]	曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[5]	程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[6]	邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[7]	单晓英, 任迎春. 基于改进麻雀搜索优化支持向量机的渔船捕捞方式识别 Fishing Type Identification of Marine Fishing Vessels Based on Support Vector Machine Optimized by Improved Sparrow Search Algorithm 计算机科学, 2022, 49(6A): 211-216. https://doi.org/10.11896/jsjkx.220300216
[8]	王文强, 贾星星, 李朋. 自适应的集成定序算法 Adaptive Ensemble Ordering Algorithm 计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108
[9]	陈景年. 一种适于多分类问题的支持向量机加速方法 Acceleration of SVM for Multi-class Classification 计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[10]	阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏. Grassberger熵随机森林在窃电行为检测的应用 Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection 计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032
[11]	刘伟业, 鲁慧民, 李玉鹏, 马宁. 指静脉识别技术研究综述 Survey on Finger Vein Recognition Research 计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
[12]	侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真. 一种基于支持向量机的主动度量学习算法 Active Metric Learning Based on Support Vector Machines 计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034
[13]	高元浩, 罗晓清, 张战成. 基于特征分离的红外与可见光图像融合算法 Infrared and Visible Image Fusion Based on Feature Separation 计算机科学, 2022, 49(5): 58-63. https://doi.org/10.11896/jsjkx.210200148
[14]	邢云冰, 龙广玉, 胡春雨, 忽丽莎. 基于SVM的类别增量人体活动识别方法 Human Activity Recognition Method Based on Class Increment SVM 计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
[15]	武玉坤, 李伟, 倪敏雅, 许志骋. 单类支持向量机融合深度自编码器的异常检测模型 Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder 计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于TF-IDF的Webshell文件检测

Webshell File Detection Method Based on TF-IDF

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0