计算机科学 ›› 2022, Vol. 49 ›› Issue (8): 336-343.doi: 10.11896/jsjkx.210900203
张光华1,2, 高天娇1, 陈振国3, 于乃文1
ZHANG Guang-hua1,2, GAO Tian-jiao1, CHEN Zhen-guo3, YU Nai-wen1
摘要: 为了解决恶意软件分类准确率不高的问题,提出了一种基于N-Gram静态分析技术的恶意软件分类方法。首先,通过N-Gram方法在恶意软件样本中提取长度为2的字节序列;其次,根据提取的特征利用KNN、逻辑回归、随机森林、XGBoost训练基于机器学习的恶意软件分类模型;然后,使用混淆矩阵和对数损失函数对恶意软件分类模型进行评价;最后,将恶意软件分类模型在Kaggle恶意软件数据集中进行训练和测试。实验结果表明,XGBoost和随机森林的恶意软件分类模型准确率分别达到了98.43%和97.93%,Log Loss值分别为0.022240和0.026946。与已有方法相比,通过N-Gram进行特征提取的方法可以更准确地对恶意软件进行分类,保护计算机系统免受恶意软件的攻击。
中图分类号:
[1]GUAN C,SUN K,LEI L,et al.Danger Neighbor attack:Information leakage via post Message mechanism in HTML5 [J].Computers & Security,2019(80):291-305. [2]YE Y,LI T,ADJEROH D,et al.A survey on malware detection using data mining techniques [J].ACM Computing Surveys (CSUR),2017,50(3):1-40. [3]Malware bytes Labs.2020 State of Malware Report [R].Ireland:Malwarebytes,2020. [4]Malware bytes Labs.2021 State of Malware Report [R].Ireland:Malwarebytes,2020. [5]SINGH J,SINGH J.A survey on machine learning-based malware detection in executable files [J].Journal of Systems Architecture,2021(112):101861. [6]Microsoft.Microsoft Malware Classification Challenge(BIG2015)[EB/OL].https://www.kaggle.com/c/malware-classification/overview. [7]CAYIR A,UNAL U,DAG H.Random CapsNet forest model for imbalanced malware type classification task [J].Computers &Security,2021(102):102133. [8]XIAO G,LI J,CHEN Y,et al.MalFCS:An effective malware classification framework with automated feature extraction based on deep convolutional neural networks [J].Journal of Parallel and Distributed Computing,2020(141):49-58. [9]NARAYANAN B N,DJANEYE-BOUNDJOU O,KEBEDE T.Performance analysis of machine learning and pattern recognition algorithms for Malware classification [C]//2016 IEEE National Aerospace and Electronics Conference(NAECON) and Ohio Innovation Summit(OIS).2016:338-342. [10]ZHANG W,MEMBER S,WANG H,et al.DAMBA:Detecting android malware by ORGB analysis [J].IEEE Transacyions Reliability,2020,69(1):55-69. [11]YOUSEFI-AZAR M,VARADHARAJAN V,HAMEY L.Au-toencoder-based feature learning for cyber security applications [C]//2017 International Joint Conference on Neural Networks(IJCNN).2017:3854-3861. [12]KIM J,BU S,CHO S.Zero-day malware detection using transferred generative adversarial networks based on deep autoenco-ders[J].Information Sciences,2018(460/461):83-102. [13]CHENG C,LI C,HAN Y,et al.A semi-supervised deep learning image caption model based on Pseudo Label and N-gram [J].International Journal of Approximate Reasoning,2021(131):93-107. [14]RAFF E,ZAK R,COX R,et al.An investigation of byte n-gram features for malware classification [J].Journal Computer Viro-logy and Hacking Techniques,2018(14):1-20. [15]EI BOUJNOUNI H,RAHOUTI M,EI BOUJNOUNI M.Identification of SARS-CoV-2 origin:using Ngrams,principal component analysis and random Forest algorithm [J].Informatics in Medicine Unlocked,2021(24):100577. [16]HAN X,JIN F,WANG R,et al.Classification of malware for self-driving systems [J].Neurocomputing,2021(428):352-360. [17]YOO S,KIM S,KIM S,et al.AI-HydRa:Advanced hybrid approach using random forest and deep learning for malware classification [J].Information Sciences,2021(546):420-435. [18]HUDA S,MIAH S,HASSAN M,et al.Defending unknown attacks on cyber-physical systems by semi-supervised approach and available unlabeled data [J].Information Sciences,2017(379):211-228. [19]CUI Z,DU L,WANG P,et al.Malicious code detection based on CNNs and multi-objective algorithm [J].Journal of Parallel and Distributed Computing,2019(129):50-58. [20]Lracker.Worm.WhBoy.cw-Killer[EB/OL].(2020-01-09)[2021-09-28].https://github.com/lracker/Worm.WhBoy.cw-Killer. [21]SURENDRAN R,THOMAS T,EMMANUEL S.A TAN based hybrid model for android malware detection [J].Journal of Information Security and Applications,2020(54):102483. [22]Baidubaike.WannaCry[EB/OL].[2021-09-28].https://baike.baidu.com/item/WannaCry/20797421. [23]SIKORSKI M,HONIG A.Practical malware analysis:Thehands-on guide to dissecting malicious software [M].San Francisco:No Starch Press,2014. [24]GAO X,HU C,SHAN C,et al.Malware classification for the cloud via semi-supervised transfer learning [J].Journal of Information Security and Applications,2020(55):102661. [25]DAMODARAN A,TROIA F,DI TROIA F,et al.A comparison of static,dynamic,and hybrid analysis for malware detection [J].Journal of Computer Virology and Hacking Techniques,2017,13(1):1-12. [26]RONEN R,RADU M,FEUERSTEIN C,et al.Microsoft mal-ware classification challenge[J].arXiv:1802.10135,2018. [27]GIBERT D,MATEU C,PLANES J,et al.Auditing static machine learning anti-Malware tools against metamorphic attacks [J].Computers & Security,2021(102):102159. |
[1] | 冷典典, 杜鹏, 陈建廷, 向阳. 面向自动化集装箱码头的AGV行驶时间估计 Automated Container Terminal Oriented Travel Time Estimation of AGV 计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028 |
[2] | 宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053 |
[3] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[4] | 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094 |
[5] | 陈明鑫, 张钧波, 李天瑞. 联邦学习攻防研究综述 Survey on Attacks and Defenses in Federated Learning 计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079 |
[6] | 肖治鸿, 韩晔彤, 邹永攀. 基于多源数据和逻辑推理的行为识别技术研究 Study on Activity Recognition Based on Multi-source Data and Logical Reasoning 计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270 |
[7] | 姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮. 一种基于异质模型融合的 Android 终端恶意软件检测方法 Android Malware Detection Method Based on Heterogeneous Model Fusion 计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103 |
[8] | 王飞, 黄涛, 杨晔. 基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究 Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion 计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030 |
[9] | 李亚茹, 张宇来, 王佳晨. 面向超参数估计的贝叶斯优化方法综述 Survey on Bayesian Optimization Methods for Hyper-parameter Tuning 计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208 |
[10] | 赵璐, 袁立明, 郝琨. 多示例学习算法综述 Review of Multi-instance Learning Algorithms 计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047 |
[11] | 许杰, 祝玉坤, 邢春晓. 机器学习在金融资产定价中的应用研究综述 Application of Machine Learning in Financial Asset Pricing:A Review 计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127 |
[12] | 赵静文, 付岩, 吴艳霞, 陈俊文, 冯云, 董继斌, 刘嘉琪. 多线程数据竞争检测技术研究综述 Survey on Multithreaded Data Race Detection Techniques 计算机科学, 2022, 49(6): 89-98. https://doi.org/10.11896/jsjkx.210700187 |
[13] | 李野, 陈松灿. 基于物理信息的神经网络:最新进展与展望 Physics-informed Neural Networks:Recent Advances and Prospects 计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158 |
[14] | 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明. 大数据驱动的社会经济地位分析研究综述 Big Data-driven Based Socioeconomic Status Analysis:A Survey 计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014 |
[15] | 章晓庆, 方建生, 肖尊杰, 陈浜, RisaHIGASHITA, 陈婉, 袁进, 刘江. 基于眼前节相干光断层扫描成像的核性白内障分类算法 Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image 计算机科学, 2022, 49(3): 204-210. https://doi.org/10.11896/jsjkx.201100085 |
|