Computer Science ›› 2022, Vol. 49 ›› Issue (8): 336-343.doi: 10.11896/jsjkx.210900203

• Information Security • Previous Articles     Next Articles

Study on Malware Classification Based on N-Gram Static Analysis Technology

ZHANG Guang-hua1,2, GAO Tian-jiao1, CHEN Zhen-guo3, YU Nai-wen1   

  1. 1 School of Information Science and Engineering,Hebei University of Science and Technology,Shijiazhuang 050018,China
    2 State Key Laboratory of Integrated Services Networks,Xidian University,Xi’an 710071,China
    3 Hebei IoT Monitoring Engineering Technology Research Center,North China Institute of Science and Technology,Langfang,Hebei 065201,China
  • Received:2021-09-24 Revised:2022-03-11 Published:2022-08-02
  • About author:ZHANG Guang-hua,born in 1979,Ph.D,professor,master supervisor,is a senior member of China Computer Fede-ration.His main research interests include network and information security.
    YU Nai-wen,born in 1983,master,assistant researcher.Her main research interests include computer network management and so on.
  • Supported by:
    National Key Research and Development Program of China(2018YFB0804701),National Natural Science Foundation of China(62072239) and Science and Technology Program of Hebei Science and Technology Department(20377725D).

Abstract: In order to solve the problem of low accuracy of malware classification,this paper proposes a research on malware classification based on N-Gram static analysis technology.Firstly,the N-Gram method is used to extract the byte sequence of length 2 from the malware samples.Secondly,according to the extracted features,KNN,logistic regression,random forest and XGBoost are used to train the malware classification model based on machine learning.Thirdly,the confusion matrix and logarithmic loss function are used to evaluate the malware classification model.Finally,the malware classification model is trained and tested in the Kaggle malware data set.Experimental results show that the accuracy rates of the malware classification models of XGBoost and random forest reach 98.43% and 97.93%,and the Log Loss values are 0.022240 and 0.026946,respectively.Compared with the existing methods,the proposed method can classify malware more accurately and protect computer system from malware attack.

Key words: Machine learning, Malware, N-Gram, Static analysis

CLC Number: 

  • TP309
[1]GUAN C,SUN K,LEI L,et al.Danger Neighbor attack:Information leakage via post Message mechanism in HTML5 [J].Computers & Security,2019(80):291-305.
[2]YE Y,LI T,ADJEROH D,et al.A survey on malware detection using data mining techniques [J].ACM Computing Surveys (CSUR),2017,50(3):1-40.
[3]Malware bytes Labs.2020 State of Malware Report [R].Ireland:Malwarebytes,2020.
[4]Malware bytes Labs.2021 State of Malware Report [R].Ireland:Malwarebytes,2020.
[5]SINGH J,SINGH J.A survey on machine learning-based malware detection in executable files [J].Journal of Systems Architecture,2021(112):101861.
[6]Microsoft.Microsoft Malware Classification Challenge(BIG2015)[EB/OL].
[7]CAYIR A,UNAL U,DAG H.Random CapsNet forest model for imbalanced malware type classification task [J].Computers &Security,2021(102):102133.
[8]XIAO G,LI J,CHEN Y,et al.MalFCS:An effective malware classification framework with automated feature extraction based on deep convolutional neural networks [J].Journal of Parallel and Distributed Computing,2020(141):49-58.
[9]NARAYANAN B N,DJANEYE-BOUNDJOU O,KEBEDE T.Performance analysis of machine learning and pattern recognition algorithms for Malware classification [C]//2016 IEEE National Aerospace and Electronics Conference(NAECON) and Ohio Innovation Summit(OIS).2016:338-342.
[10]ZHANG W,MEMBER S,WANG H,et al.DAMBA:Detecting android malware by ORGB analysis [J].IEEE Transacyions Reliability,2020,69(1):55-69.
[11]YOUSEFI-AZAR M,VARADHARAJAN V,HAMEY L.Au-toencoder-based feature learning for cyber security applications [C]//2017 International Joint Conference on Neural Networks(IJCNN).2017:3854-3861.
[12]KIM J,BU S,CHO S.Zero-day malware detection using transferred generative adversarial networks based on deep autoenco-ders[J].Information Sciences,2018(460/461):83-102.
[13]CHENG C,LI C,HAN Y,et al.A semi-supervised deep learning image caption model based on Pseudo Label and N-gram [J].International Journal of Approximate Reasoning,2021(131):93-107.
[14]RAFF E,ZAK R,COX R,et al.An investigation of byte n-gram features for malware classification [J].Journal Computer Viro-logy and Hacking Techniques,2018(14):1-20.
[15]EI BOUJNOUNI H,RAHOUTI M,EI BOUJNOUNI M.Identification of SARS-CoV-2 origin:using Ngrams,principal component analysis and random Forest algorithm [J].Informatics in Medicine Unlocked,2021(24):100577.
[16]HAN X,JIN F,WANG R,et al.Classification of malware for self-driving systems [J].Neurocomputing,2021(428):352-360.
[17]YOO S,KIM S,KIM S,et al.AI-HydRa:Advanced hybrid approach using random forest and deep learning for malware classification [J].Information Sciences,2021(546):420-435.
[18]HUDA S,MIAH S,HASSAN M,et al.Defending unknown attacks on cyber-physical systems by semi-supervised approach and available unlabeled data [J].Information Sciences,2017(379):211-228.
[19]CUI Z,DU L,WANG P,et al.Malicious code detection based on CNNs and multi-objective algorithm [J].Journal of Parallel and Distributed Computing,2019(129):50-58.
[21]SURENDRAN R,THOMAS T,EMMANUEL S.A TAN based hybrid model for android malware detection [J].Journal of Information Security and Applications,2020(54):102483.
[23]SIKORSKI M,HONIG A.Practical malware analysis:Thehands-on guide to dissecting malicious software [M].San Francisco:No Starch Press,2014.
[24]GAO X,HU C,SHAN C,et al.Malware classification for the cloud via semi-supervised transfer learning [J].Journal of Information Security and Applications,2020(55):102661.
[25]DAMODARAN A,TROIA F,DI TROIA F,et al.A comparison of static,dynamic,and hybrid analysis for malware detection [J].Journal of Computer Virology and Hacking Techniques,2017,13(1):1-12.
[26]RONEN R,RADU M,FEUERSTEIN C,et al.Microsoft mal-ware classification challenge[J].arXiv:1802.10135,2018.
[27]GIBERT D,MATEU C,PLANES J,et al.Auditing static machine learning anti-Malware tools against metamorphic attacks [J].Computers & Security,2021(102):102159.
[1] LENG Dian-dian, DU Peng, CHEN Jian-ting, XIANG Yang. Automated Container Terminal Oriented Travel Time Estimation of AGV [J]. Computer Science, 2022, 49(9): 208-214.
[2] NING Han-yang, MA Miao, YANG Bo, LIU Shi-chang. Research Progress and Analysis on Intelligent Cryptology [J]. Computer Science, 2022, 49(9): 288-296.
[3] LI Yao, LI Tao, LI Qi-fan, LIANG Jia-rui, Ibegbu Nnamdi JULIAN, CHEN Jun-jie, GUO Hao. Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network [J]. Computer Science, 2022, 49(8): 257-266.
[4] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[5] CHEN Ming-xin, ZHANG Jun-bo, LI Tian-rui. Survey on Attacks and Defenses in Federated Learning [J]. Computer Science, 2022, 49(7): 310-323.
[6] LI Ya-ru, ZHANG Yu-lai, WANG Jia-chen. Survey on Bayesian Optimization Methods for Hyper-parameter Tuning [J]. Computer Science, 2022, 49(6A): 86-92.
[7] ZHAO Lu, YUAN Li-ming, HAO Kun. Review of Multi-instance Learning Algorithms [J]. Computer Science, 2022, 49(6A): 93-99.
[8] WANG Fei, HUANG Tao, YANG Ye. Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion [J]. Computer Science, 2022, 49(6A): 784-789.
[9] XIAO Zhi-hong, HAN Ye-tong, ZOU Yong-pan. Study on Activity Recognition Based on Multi-source Data and Logical Reasoning [J]. Computer Science, 2022, 49(6A): 397-406.
[10] YAO Ye, ZHU Yi-an, QIAN Liang, JIA Yao, ZHANG Li-xiang, LIU Rui-liang. Android Malware Detection Method Based on Heterogeneous Model Fusion [J]. Computer Science, 2022, 49(6A): 508-515.
[11] ZHAO Jing-wen, FU Yan, WU Yan-xia, CHEN Jun-wen, FENG Yun, DONG Ji-bin, LIU Jia-qi. Survey on Multithreaded Data Race Detection Techniques [J]. Computer Science, 2022, 49(6): 89-98.
[12] XU Jie, ZHU Yu-kun, XING Chun-xiao. Application of Machine Learning in Financial Asset Pricing:A Review [J]. Computer Science, 2022, 49(6): 276-286.
[13] YAO Xiao-ming, DING Shi-chang, ZHAO Tao, HUANG Hong, LUO Jar-der, FU Xiao-ming. Big Data-driven Based Socioeconomic Status Analysis:A Survey [J]. Computer Science, 2022, 49(4): 80-87.
[14] LI Ye, CHEN Song-can. Physics-informed Neural Networks:Recent Advances and Prospects [J]. Computer Science, 2022, 49(4): 254-262.
[15] ZHANG Ying-li, MA Jia-li, LIU Zi-ang, LIU Xin, ZHOU Rui. Overview of Vulnerability Detection Methods for Ethereum Solidity Smart Contracts [J]. Computer Science, 2022, 49(3): 52-61.
Full text



No Suggested Reading articles found!