Computer Science ›› 2020, Vol. 47 ›› Issue (11A): 363-367.doi: 10.11896/jsjkx.200100064

• Information Security • Previous Articles     Next Articles

Webshell File Detection Method Based on TF-IDF

ZHAO Rui-jie1, SHI Yong1,2, ZHANG Han1, LONG Jun1, XUE Zhi1,2   

  1. 1 School of Cyber Science and Engineering,Shanghai JiaoTong University,Shanghai 200240,China
    2 Shanghai Information Security Integrated Management Technology Laboratory,Shanghai 200240,China
  • Online:2020-11-15 Published:2020-11-17
  • About author:ZHAO Rui-jie,born in 1995,postgra-duate.His main research interests include network security,machine lear-ning and data mining.
    XUE Zhi,born in 1971,Ph.D,professor,Ph.D supervisor.His main research interests include cyber security,communication technology and machine learning.
  • Supported by:
    This work was supported by the National Key R&D Programe of China(2017YFB0803200).

Abstract: With the rapid development of Internet,cyber attacks are becoming more frequent.Webshell is a common cyber attack method,and traditional detection methods are unable to cope with complex and flexible variants of Webshell attacks.In order to solve this problem,webshell detection method based on TF-IDF is proposed.First of all,the system classifies Webshell files and transcodes different files accordingly to reduce the impact of confusion and interference technology on detection,then build a bag of words model and use TF-IDF algorithm to weight extract relevant features,and finally uses the XGBoost algorithm to train the detection model.Compared with the traditional machine learning algorithm,the Webshell detection model based on TF-IDF and XGBoost algorithm has higher accuracy than the traditional detection method,and has stronger robustness and generalization capabilities.The detection accuracy of XGBoost algorithm for PHP type files can reach 98.09%,and the accuracy for JSP type files can reach 97.09%.

Key words: Cross validation, Feature extraction, Multi-layer perception, Random forest, Support vector machine, TF-IDF, Webshell detection, XGBoost algorithm

CLC Number: 

  • TP393
[1] SHI L Y,FANG Y.Research on Webshell Detection MethodBased on Web Log [J].Information Security Research,2016,2(1):66-73.
[2] DAI H,LI J,LU X Y,et al.Machine learning algorithm for intelligent detection of WebShell [J].Journal of Network and Information Security,2017,3(4):51-57.
[3] GOLDBERG D E.Genetic algorithms in search,optimization and machine learning[M].Addison-wesley Longman Publishing Co.,1989.
[4] BUCZAK A L,GUVEN E.A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection[J].IEEE Communications Surveys & Tutorials,2017,18(2):1153-1176.
[5] XIAO H,RASUL K,VOLLGRAF R.Fashion-MNIST:a Novel Image Dataset for Benchmarking Machine Learning Algorithms[J].arXiv:1708.07747,2017.
[6] STOLFO S J,LEE W.A data mining framework for constructing features and models for intrusion detection systems (computer security,network security)[M].Columbia University 2960 Broadway,1999:227-261.
[7] YE F,GONG J,YANG W.Webshell black box detection based on support vector machine [J].Journal of Nanjing University of Aeronautics and Astronautics,2015(6):924-930.
[8] FU J M,LI L,WANG Y J.Webshell File Detection Based on CNN [J].Journal of Zhengzhou University (Science Edition),2019,51(2):4-11.
[9] QI J J.Stealing WebShell Detection Method [J].Computer and Network,2015(13):38-39.
[10] MEI R,ZHANG T.Research on WebShell detection methodbased on SVM classifier under Linux [J].Information Network Security,2014(5):5-9.
[11] CHI Y P,LING Z T,WANG Z Q,et al.Intrusion Detection System Based on Support Vector Machine and Adaboost [J/OL].Computer Engineering,2019,45(10):183-188.
[12] WANG Y.Design and implementation of pedestrian detection algorithm based on random gradient boosting decision tree [D].Hangzhou:Zhejiang University,2017.
[13] CHEN J,LI K,TANG Z,et al.A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment[J].IEEE Transactions on Parallel & Distributed Systems,2017,PP(99):1-1.
[14] TU X Y,YU L,GENG Z C,et al.A Method for Early Warning of Leakage Accidents Based on Large-scale Time Series [J].Information Technology,2018,42(12):1-4.
[15] CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2016:785-794.
[16] ZHENG H,YUAN J,CHEN L.Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation[J].Energies,2017,10(8):1168.
[17] LI Y F,WANG Y,LI J H.Repeatability of several cross-validation tests [J].Journal of Taiyuan Normal University (Natural Science Edition),2013(4):46-49.
[18] WANG K,HOU Z R,WANG C L.Network Intrusion Detection Based on Cross-Validation SVM [J].Journal of Test and Measurement Technology,2010,24(5):419-423.
[19] GUTLEIN M,HELMA C,KARWATH A,et al.A Large-Scale Empirical Evaluation of Cross-Validation and External Test Set Validation in (Q)SAR[J].Molecular Informatics,2013,32(5/6):516-528.
[1] ZHANG Yuan, KANG Le, GONG Zhao-hui, ZHANG Zhi-hong. Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM [J]. Computer Science, 2022, 49(7): 31-39.
[2] GAO Zhen-zhuo, WANG Zhi-hai, LIU Hai-yang. Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features [J]. Computer Science, 2022, 49(7): 40-49.
[3] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[4] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
[5] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[6] QUE Hua-kun, FENG Xiao-feng, LIU Pan-long, GUO Wen-chong, LI Jian, ZENG Wei-liang, FAN Jing-min. Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection [J]. Computer Science, 2022, 49(6A): 790-794.
[7] SHAO Xin-xin. TI-FastText Automatic Goods Classification Algorithm [J]. Computer Science, 2022, 49(6A): 206-210.
[8] SHAN Xiao-ying, REN Ying-chun. Fishing Type Identification of Marine Fishing Vessels Based on Support Vector Machine Optimized by Improved Sparrow Search Algorithm [J]. Computer Science, 2022, 49(6A): 211-216.
[9] WANG Wen-qiang, JIA Xing-xing, LI Peng. Adaptive Ensemble Ordering Algorithm [J]. Computer Science, 2022, 49(6A): 242-246.
[10] CHEN Jing-nian. Acceleration of SVM for Multi-class Classification [J]. Computer Science, 2022, 49(6A): 297-300.
[11] LIU Wei-ye, LU Hui-min, LI Yu-peng, MA Ning. Survey on Finger Vein Recognition Research [J]. Computer Science, 2022, 49(6A): 1-11.
[12] HOU Xia-ye, CHEN Hai-yan, ZHANG Bing, YUAN Li-gang, JIA Yi-zhen. Active Metric Learning Based on Support Vector Machines [J]. Computer Science, 2022, 49(6A): 113-118.
[13] GAO Yuan-hao, LUO Xiao-qing, ZHANG Zhan-cheng. Infrared and Visible Image Fusion Based on Feature Separation [J]. Computer Science, 2022, 49(5): 58-63.
[14] XING Yun-bing, LONG Guang-yu, HU Chun-yu, HU Li-sha. Human Activity Recognition Method Based on Class Increment SVM [J]. Computer Science, 2022, 49(5): 78-83.
[15] ZUO Jie-ge, LIU Xiao-ming, CAI Bing. Outdoor Image Weather Recognition Based on Image Blocks and Feature Fusion [J]. Computer Science, 2022, 49(3): 197-203.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!