计算机科学 ›› 2020, Vol. 47 ›› Issue (11A): 363-367.doi: 10.11896/jsjkx.200100064

• 信息安全 • 上一篇    下一篇

基于TF-IDF的Webshell文件检测

赵瑞杰1, 施勇1,2, 张涵1, 龙军1, 薛质1,2   

  1. 1 上海交通大学网络空间安全学院 上海 200240
    2 上海市信息安全综合管理技术实验室 上海 200240
  • 出版日期:2020-11-15 发布日期:2020-11-17
  • 通讯作者: 薛质 (zxue@sjtu.edu.cn)
  • 作者简介:ruijiezhao@sjtu.edu.cn
  • 基金资助:
    国家重点研发计划项目(2017YFB0803200)

Webshell File Detection Method Based on TF-IDF

ZHAO Rui-jie1, SHI Yong1,2, ZHANG Han1, LONG Jun1, XUE Zhi1,2   

  1. 1 School of Cyber Science and Engineering,Shanghai JiaoTong University,Shanghai 200240,China
    2 Shanghai Information Security Integrated Management Technology Laboratory,Shanghai 200240,China
  • Online:2020-11-15 Published:2020-11-17
  • About author:ZHAO Rui-jie,born in 1995,postgra-duate.His main research interests include network security,machine lear-ning and data mining.
    XUE Zhi,born in 1971,Ph.D,professor,Ph.D supervisor.His main research interests include cyber security,communication technology and machine learning.
  • Supported by:
    This work was supported by the National Key R&D Programe of China(2017YFB0803200).

摘要: 随着互联网的飞速发展,网络攻击行为日益频繁。Webshell是常见的网络攻击方式,而传统的检测手段已无法应对复杂灵活的变种 Webshell攻击。为解决这一问题,提出了一种基于TF-IDF的Webshell文件检测方法。系统首先对不同类型的Webshell文件进行分类,并对不同文件进行相应的预处理转码,以降低混淆干扰技术对检测的影响;随后建立词袋模型,并采用TF-IDF算法加权提取相关特征;最后使用XGBoost算法训练得到检测模型。与传统机器学习算法进行的10折交叉验证对比测试表明,使用TF-IDF算法预处理后结合XGBoost算法的Webshell文件检测模型性能出色,检测效果相较于传统检测方法在准确率、精确率、召回率等方面均有所提高,同时具备更强的鲁棒性与泛化能力,其中对PHP类型文件检测的准确率达到了98.09%,对JSP类型文件检测准确率达到了97.09%。

关键词: TF-IDF, Webshell检测, XGBoost算法, 多层神经网络, 交叉验证, 随机森林, 特征提取, 支持向量机

Abstract: With the rapid development of Internet,cyber attacks are becoming more frequent.Webshell is a common cyber attack method,and traditional detection methods are unable to cope with complex and flexible variants of Webshell attacks.In order to solve this problem,webshell detection method based on TF-IDF is proposed.First of all,the system classifies Webshell files and transcodes different files accordingly to reduce the impact of confusion and interference technology on detection,then build a bag of words model and use TF-IDF algorithm to weight extract relevant features,and finally uses the XGBoost algorithm to train the detection model.Compared with the traditional machine learning algorithm,the Webshell detection model based on TF-IDF and XGBoost algorithm has higher accuracy than the traditional detection method,and has stronger robustness and generalization capabilities.The detection accuracy of XGBoost algorithm for PHP type files can reach 98.09%,and the accuracy for JSP type files can reach 97.09%.

Key words: Cross validation, Feature extraction, Multi-layer perception, Random forest, Support vector machine, TF-IDF, Webshell detection, XGBoost algorithm

中图分类号: 

  • TP393
[1] SHI L Y,FANG Y.Research on Webshell Detection MethodBased on Web Log [J].Information Security Research,2016,2(1):66-73.
[2] DAI H,LI J,LU X Y,et al.Machine learning algorithm for intelligent detection of WebShell [J].Journal of Network and Information Security,2017,3(4):51-57.
[3] GOLDBERG D E.Genetic algorithms in search,optimization and machine learning[M].Addison-wesley Longman Publishing Co.,1989.
[4] BUCZAK A L,GUVEN E.A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection[J].IEEE Communications Surveys & Tutorials,2017,18(2):1153-1176.
[5] XIAO H,RASUL K,VOLLGRAF R.Fashion-MNIST:a Novel Image Dataset for Benchmarking Machine Learning Algorithms[J].arXiv:1708.07747,2017.
[6] STOLFO S J,LEE W.A data mining framework for constructing features and models for intrusion detection systems (computer security,network security)[M].Columbia University 2960 Broadway,1999:227-261.
[7] YE F,GONG J,YANG W.Webshell black box detection based on support vector machine [J].Journal of Nanjing University of Aeronautics and Astronautics,2015(6):924-930.
[8] FU J M,LI L,WANG Y J.Webshell File Detection Based on CNN [J].Journal of Zhengzhou University (Science Edition),2019,51(2):4-11.
[9] QI J J.Stealing WebShell Detection Method [J].Computer and Network,2015(13):38-39.
[10] MEI R,ZHANG T.Research on WebShell detection methodbased on SVM classifier under Linux [J].Information Network Security,2014(5):5-9.
[11] CHI Y P,LING Z T,WANG Z Q,et al.Intrusion Detection System Based on Support Vector Machine and Adaboost [J/OL].Computer Engineering,2019,45(10):183-188.
[12] WANG Y.Design and implementation of pedestrian detection algorithm based on random gradient boosting decision tree [D].Hangzhou:Zhejiang University,2017.
[13] CHEN J,LI K,TANG Z,et al.A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment[J].IEEE Transactions on Parallel & Distributed Systems,2017,PP(99):1-1.
[14] TU X Y,YU L,GENG Z C,et al.A Method for Early Warning of Leakage Accidents Based on Large-scale Time Series [J].Information Technology,2018,42(12):1-4.
[15] CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2016:785-794.
[16] ZHENG H,YUAN J,CHEN L.Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation[J].Energies,2017,10(8):1168.
[17] LI Y F,WANG Y,LI J H.Repeatability of several cross-validation tests [J].Journal of Taiyuan Normal University (Natural Science Edition),2013(4):46-49.
[18] WANG K,HOU Z R,WANG C L.Network Intrusion Detection Based on Cross-Validation SVM [J].Journal of Test and Measurement Technology,2010,24(5):419-423.
[19] GUTLEIN M,HELMA C,KARWATH A,et al.A Large-Scale Empirical Evaluation of Cross-Validation and External Test Set Validation in (Q)SAR[J].Molecular Informatics,2013,32(5/6):516-528.
[1] 张源, 康乐, 宫朝辉, 张志鸿.
基于Bi-LSTM的期货市场关联交易行为检测方法
Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM
计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[2] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[3] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[4] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[5] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[6] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[7] 单晓英, 任迎春.
基于改进麻雀搜索优化支持向量机的渔船捕捞方式识别
Fishing Type Identification of Marine Fishing Vessels Based on Support Vector Machine Optimized by Improved Sparrow Search Algorithm
计算机科学, 2022, 49(6A): 211-216. https://doi.org/10.11896/jsjkx.220300216
[8] 王文强, 贾星星, 李朋.
自适应的集成定序算法
Adaptive Ensemble Ordering Algorithm
计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108
[9] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[10] 阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏.
Grassberger熵随机森林在窃电行为检测的应用
Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection
计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032
[11] 刘伟业, 鲁慧民, 李玉鹏, 马宁.
指静脉识别技术研究综述
Survey on Finger Vein Recognition Research
计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
[12] 侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真.
一种基于支持向量机的主动度量学习算法
Active Metric Learning Based on Support Vector Machines
计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034
[13] 高元浩, 罗晓清, 张战成.
基于特征分离的红外与可见光图像融合算法
Infrared and Visible Image Fusion Based on Feature Separation
计算机科学, 2022, 49(5): 58-63. https://doi.org/10.11896/jsjkx.210200148
[14] 邢云冰, 龙广玉, 胡春雨, 忽丽莎.
基于SVM的类别增量人体活动识别方法
Human Activity Recognition Method Based on Class Increment SVM
计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
[15] 武玉坤, 李伟, 倪敏雅, 许志骋.
单类支持向量机融合深度自编码器的异常检测模型
Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder
计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!