计算机科学 ›› 2022, Vol. 49 ›› Issue (8): 336-343.doi: 10.11896/jsjkx.210900203

• 信息安全 • 上一篇    下一篇

基于N-Gram静态分析技术的恶意软件分类研究

张光华1,2, 高天娇1, 陈振国3, 于乃文1   

  1. 1 河北科技大学信息科学与工程学院 石家庄 050018
    2 西安电子科技大学综合业务网理论及关键技术国家重点实验室 西安 710071
    3 华北科技学院河北省物联网监控工程技术研究中心 河北 廊坊 065201
  • 收稿日期:2021-09-24 修回日期:2022-03-11 发布日期:2022-08-02
  • 通讯作者: 于乃文(yunaiwen@hebust.edu.cn)
  • 作者简介:(zhanggh@hebust.edu.cn)
  • 基金资助:
    国家重点研发计划(2018YFB0804701);国家自然科学基金(62072239);河北省科技厅科技计划(20377725D)

Study on Malware Classification Based on N-Gram Static Analysis Technology

ZHANG Guang-hua1,2, GAO Tian-jiao1, CHEN Zhen-guo3, YU Nai-wen1   

  1. 1 School of Information Science and Engineering,Hebei University of Science and Technology,Shijiazhuang 050018,China
    2 State Key Laboratory of Integrated Services Networks,Xidian University,Xi’an 710071,China
    3 Hebei IoT Monitoring Engineering Technology Research Center,North China Institute of Science and Technology,Langfang,Hebei 065201,China
  • Received:2021-09-24 Revised:2022-03-11 Published:2022-08-02
  • About author:ZHANG Guang-hua,born in 1979,Ph.D,professor,master supervisor,is a senior member of China Computer Fede-ration.His main research interests include network and information security.
    YU Nai-wen,born in 1983,master,assistant researcher.Her main research interests include computer network management and so on.
  • Supported by:
    National Key Research and Development Program of China(2018YFB0804701),National Natural Science Foundation of China(62072239) and Science and Technology Program of Hebei Science and Technology Department(20377725D).

摘要: 为了解决恶意软件分类准确率不高的问题,提出了一种基于N-Gram静态分析技术的恶意软件分类方法。首先,通过N-Gram方法在恶意软件样本中提取长度为2的字节序列;其次,根据提取的特征利用KNN、逻辑回归、随机森林、XGBoost训练基于机器学习的恶意软件分类模型;然后,使用混淆矩阵和对数损失函数对恶意软件分类模型进行评价;最后,将恶意软件分类模型在Kaggle恶意软件数据集中进行训练和测试。实验结果表明,XGBoost和随机森林的恶意软件分类模型准确率分别达到了98.43%和97.93%,Log Loss值分别为0.022240和0.026946。与已有方法相比,通过N-Gram进行特征提取的方法可以更准确地对恶意软件进行分类,保护计算机系统免受恶意软件的攻击。

关键词: N-Gram, 恶意软件, 机器学习, 静态分析

Abstract: In order to solve the problem of low accuracy of malware classification,this paper proposes a research on malware classification based on N-Gram static analysis technology.Firstly,the N-Gram method is used to extract the byte sequence of length 2 from the malware samples.Secondly,according to the extracted features,KNN,logistic regression,random forest and XGBoost are used to train the malware classification model based on machine learning.Thirdly,the confusion matrix and logarithmic loss function are used to evaluate the malware classification model.Finally,the malware classification model is trained and tested in the Kaggle malware data set.Experimental results show that the accuracy rates of the malware classification models of XGBoost and random forest reach 98.43% and 97.93%,and the Log Loss values are 0.022240 and 0.026946,respectively.Compared with the existing methods,the proposed method can classify malware more accurately and protect computer system from malware attack.

Key words: Machine learning, Malware, N-Gram, Static analysis

中图分类号: 

  • TP309
[1]GUAN C,SUN K,LEI L,et al.Danger Neighbor attack:Information leakage via post Message mechanism in HTML5 [J].Computers & Security,2019(80):291-305.
[2]YE Y,LI T,ADJEROH D,et al.A survey on malware detection using data mining techniques [J].ACM Computing Surveys (CSUR),2017,50(3):1-40.
[3]Malware bytes Labs.2020 State of Malware Report [R].Ireland:Malwarebytes,2020.
[4]Malware bytes Labs.2021 State of Malware Report [R].Ireland:Malwarebytes,2020.
[5]SINGH J,SINGH J.A survey on machine learning-based malware detection in executable files [J].Journal of Systems Architecture,2021(112):101861.
[6]Microsoft.Microsoft Malware Classification Challenge(BIG2015)[EB/OL].https://www.kaggle.com/c/malware-classification/overview.
[7]CAYIR A,UNAL U,DAG H.Random CapsNet forest model for imbalanced malware type classification task [J].Computers &Security,2021(102):102133.
[8]XIAO G,LI J,CHEN Y,et al.MalFCS:An effective malware classification framework with automated feature extraction based on deep convolutional neural networks [J].Journal of Parallel and Distributed Computing,2020(141):49-58.
[9]NARAYANAN B N,DJANEYE-BOUNDJOU O,KEBEDE T.Performance analysis of machine learning and pattern recognition algorithms for Malware classification [C]//2016 IEEE National Aerospace and Electronics Conference(NAECON) and Ohio Innovation Summit(OIS).2016:338-342.
[10]ZHANG W,MEMBER S,WANG H,et al.DAMBA:Detecting android malware by ORGB analysis [J].IEEE Transacyions Reliability,2020,69(1):55-69.
[11]YOUSEFI-AZAR M,VARADHARAJAN V,HAMEY L.Au-toencoder-based feature learning for cyber security applications [C]//2017 International Joint Conference on Neural Networks(IJCNN).2017:3854-3861.
[12]KIM J,BU S,CHO S.Zero-day malware detection using transferred generative adversarial networks based on deep autoenco-ders[J].Information Sciences,2018(460/461):83-102.
[13]CHENG C,LI C,HAN Y,et al.A semi-supervised deep learning image caption model based on Pseudo Label and N-gram [J].International Journal of Approximate Reasoning,2021(131):93-107.
[14]RAFF E,ZAK R,COX R,et al.An investigation of byte n-gram features for malware classification [J].Journal Computer Viro-logy and Hacking Techniques,2018(14):1-20.
[15]EI BOUJNOUNI H,RAHOUTI M,EI BOUJNOUNI M.Identification of SARS-CoV-2 origin:using Ngrams,principal component analysis and random Forest algorithm [J].Informatics in Medicine Unlocked,2021(24):100577.
[16]HAN X,JIN F,WANG R,et al.Classification of malware for self-driving systems [J].Neurocomputing,2021(428):352-360.
[17]YOO S,KIM S,KIM S,et al.AI-HydRa:Advanced hybrid approach using random forest and deep learning for malware classification [J].Information Sciences,2021(546):420-435.
[18]HUDA S,MIAH S,HASSAN M,et al.Defending unknown attacks on cyber-physical systems by semi-supervised approach and available unlabeled data [J].Information Sciences,2017(379):211-228.
[19]CUI Z,DU L,WANG P,et al.Malicious code detection based on CNNs and multi-objective algorithm [J].Journal of Parallel and Distributed Computing,2019(129):50-58.
[20]Lracker.Worm.WhBoy.cw-Killer[EB/OL].(2020-01-09)[2021-09-28].https://github.com/lracker/Worm.WhBoy.cw-Killer.
[21]SURENDRAN R,THOMAS T,EMMANUEL S.A TAN based hybrid model for android malware detection [J].Journal of Information Security and Applications,2020(54):102483.
[22]Baidubaike.WannaCry[EB/OL].[2021-09-28].https://baike.baidu.com/item/WannaCry/20797421.
[23]SIKORSKI M,HONIG A.Practical malware analysis:Thehands-on guide to dissecting malicious software [M].San Francisco:No Starch Press,2014.
[24]GAO X,HU C,SHAN C,et al.Malware classification for the cloud via semi-supervised transfer learning [J].Journal of Information Security and Applications,2020(55):102661.
[25]DAMODARAN A,TROIA F,DI TROIA F,et al.A comparison of static,dynamic,and hybrid analysis for malware detection [J].Journal of Computer Virology and Hacking Techniques,2017,13(1):1-12.
[26]RONEN R,RADU M,FEUERSTEIN C,et al.Microsoft mal-ware classification challenge[J].arXiv:1802.10135,2018.
[27]GIBERT D,MATEU C,PLANES J,et al.Auditing static machine learning anti-Malware tools against metamorphic attacks [J].Computers & Security,2021(102):102159.
[1] 冷典典, 杜鹏, 陈建廷, 向阳.
面向自动化集装箱码头的AGV行驶时间估计
Automated Container Terminal Oriented Travel Time Estimation of AGV
计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[2] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[3] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[4] 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩.
基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究
Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network
计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[5] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[6] 肖治鸿, 韩晔彤, 邹永攀.
基于多源数据和逻辑推理的行为识别技术研究
Study on Activity Recognition Based on Multi-source Data and Logical Reasoning
计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[7] 姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮.
一种基于异质模型融合的 Android 终端恶意软件检测方法
Android Malware Detection Method Based on Heterogeneous Model Fusion
计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[8] 王飞, 黄涛, 杨晔.
基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究
Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion
计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[9] 李亚茹, 张宇来, 王佳晨.
面向超参数估计的贝叶斯优化方法综述
Survey on Bayesian Optimization Methods for Hyper-parameter Tuning
计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[10] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[11] 许杰, 祝玉坤, 邢春晓.
机器学习在金融资产定价中的应用研究综述
Application of Machine Learning in Financial Asset Pricing:A Review
计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127
[12] 赵静文, 付岩, 吴艳霞, 陈俊文, 冯云, 董继斌, 刘嘉琪.
多线程数据竞争检测技术研究综述
Survey on Multithreaded Data Race Detection Techniques
计算机科学, 2022, 49(6): 89-98. https://doi.org/10.11896/jsjkx.210700187
[13] 李野, 陈松灿.
基于物理信息的神经网络:最新进展与展望
Physics-informed Neural Networks:Recent Advances and Prospects
计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158
[14] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[15] 章晓庆, 方建生, 肖尊杰, 陈浜, RisaHIGASHITA, 陈婉, 袁进, 刘江.
基于眼前节相干光断层扫描成像的核性白内障分类算法
Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image
计算机科学, 2022, 49(3): 204-210. https://doi.org/10.11896/jsjkx.201100085
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!