一种改进主动学习的恶意代码检测算法

doi:10.11896/j.issn.1002-137X.2019.05.014

摘要/Abstract

摘要： 传统的恶意代码检测技术依赖于大量的已标记样本,然而新出现的恶意代码的标记数量往往较少,使得传统的机器学习检测方法难以取得较好的检测效果。针对该问题,研究了一种改进主动学习的恶意代码检测算法,提出了基于最大距离(Maximum Distance)的样本选择策略和基于最小估计风险(Minimum Risk Estimate)的样本标记策略,实现了已标记样本较少情况下的恶意代码检测。实验结果显示,相比于未使用主动学习的方法,该算法的总体检测效果更好,在已标记样本数量占比为10%的情况下,其比随机选择策略的主动学习的效果更好,在时间性能上比人工标记策略的主动学习效果更好。

关键词: 标记, 恶意代码, 估计风险, 特征, 主动学习

Abstract: The traditional malware detection technology relies on a large number of labeled samples.However,the number of marked labels is often less for the new malwares,so the traditional machine learning detection methods are difficult to get good detection results.Therefore,this paper proposed a malware detection algorithm based on active lear-ning.It contains a sample selection strategy based on Maximum Distance and a sample tagging strategy based on Minimum Risk Estimate,which can achieve better detection results with a small number of marked samples.Experimental results show that the proposed algorithm performs better than the overall detection method without active lear-ning,and the active learning effect is better when the number of labeled samples is 10% compared with the random selection strategy.Moreover,the algorithm has better temporal performance than the active learning strategy of artificial tagging strategy.

Key words: Active learning, Estimated risk, Features, Malware, Sample

中图分类号:

TP393.08

李翼宏, 刘方正, 杜镇宇. 一种改进主动学习的恶意代码检测算法[J]. 计算机科学, 2019, 46(5): 92-99. https://doi.org/10.11896/j.issn.1002-137X.2019.05.014

LI Yi-hong, LIU Fang-zheng, DU Zhen-yu. Malware Detection Algorithm for Improving Active Learning[J]. Computer Science, 2019, 46(5): 92-99. https://doi.org/10.11896/j.issn.1002-137X.2019.05.014

参考文献

[1]LIU J,SU P R,YANG M,et al.Software and Cyber Security-A Survey [J].Journal of Software,2018,29(1):42-68.(in Chinese)刘剑,苏璞睿,杨珉,等.软件与网络安全研究综述[J].软件学报,2018,29(1):42-68.
[2]TONG S,CHANG E.Support vector machine active learning for image retrieval[C]∥Proceedings of the 9th ACM International Conference on Multimedia.New York:ACM,2001:107-118.
[3]TONG S,KOLLER D.Support vector machine active learning with applications to text classiflcation[J].The Journal of Machine Learning Research,2002,2(1):999-1006.
[4]CHEN Y D,WANG T,CHEN H W.Combining Semi-Super-vised Learning and Active Learning for Shallow Semantic Parsing[J].Journal of Chinese Information Processing,2008,22(2):70-75.(in Chinese)陈耀东,王挺,陈火旺.半监督学习和主动学习相结合的浅层语义分析[J].中文信息学报,2008,22(2):70-75.
[5]JOACHIMS T.Transductive Inference for Text Classification using Support Vector Machines[C]∥Sixteenth International Conference on Machine Learning.Morgan Kaufmann Publishers Inc.,1999:200-209.
[6]SEUNG H S,OPPER M,SOMPOLINSKY H.Query By Committee[C]∥Proceedings of the 15th Annual ACM Workshop on Computational Learning Theory.California:ACM,1992:287-294.
[7]FREUND Y,SEUNG H S,SAMIR E,et al.Selective Sampling Using the Query By Committee Algorithm[J].Machine Lear-ning,1997,28(23):133-168.
[8]MAO W X,CAI Z M,TONG L.Malware Detection MethodBased on Active Learning [J].Journal of Software,2017,28(2):384-397.(in Chinese)毛蔚轩,蔡忠闽,童力.一种基于主动学习的恶意代码检测方法[J].软件学报,2017,28(2):384-397.
[9]MANKU G S,JAIN A,SARMA A D.Detecting near-duplicates for web crawling[C]∥Proceeding of the 16th International Conference on World Wide Web.USA:ACM Press,2007:141-149.
[10]ZHENG Y,WANG Y J,XUE Z.Android Malware Detection of Calls Tracing with Android Manifest and API[J].Journal of Computer Research and Development,2017(3):126-130.(in Chinese)郑尧,王轶骏,薛质.通过Android Manifest和API调用追踪的恶意检测[J].计算机技术与发展,2017(3):126-130.
[11]DUAN X Y.Research on the Malware Detection Based on Windows API Call Behavior[D].Chengdu:Southwest Jiaotong University,2016.(in Chinese)段晓云.基于Windows API调用行为的恶意软件检测研究[D].成都:西南交通大学,2016.
[12]ZHANG H J.Text Similarity Computing Based on HammingDistance[J].Computer Engineering and Applications,2001,37(19):21-22.(in Chinese)张焕炯.基于汉明距离的文本相似度计算[J].计算机工程与应用,2001,37(19):21-22.
[13]LIU D Y,QIU W J.Active Learning for Multi-label Classification Based on SVM’s Expect Margin[J].Computer Science,2011,38(4):230-232.(in Chinese)刘端阳,邱卫杰.基于SVM期望间隔的多标签分类的主动学习[J].计算机科学,2011,38(4):230-232.
[14]GOKHAN T,DILEK H,ROBERT E.Combining active andsemi-supervised learning for spoken language understanding.Speech Communication,2005,45(2):171-186.
[15]LI Z Y.A Automatic Detection Method of Malware Behavior Based on Sandbox[D].Wuhan:Huazhong University of Science and Technology,2015.(in Chinese)李志勇.基于沙箱技术的恶意代码行为自动化检测方法[D].武汉:华中科技大学,2015.

相关文章 15

[1]	胡安祥, 尹小康, 朱肖雅, 刘胜利. 基于数据流特征的比较类函数识别方法 Strcmp-like Function Identification Method Based on Data Flow Feature Matching 计算机科学, 2022, 49(9): 326-332. https://doi.org/10.11896/jsjkx.220200163
[2]	李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[3]	陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273
[4]	李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[5]	李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[6]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7]	周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[8]	苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫. 小样本雷达辐射源识别的深度学习方法综述 Survey of Deep Learning for Radar Emitter Identification Based on Small Sample 计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[9]	黄觉, 周春来. 基于本地化差分隐私的频率特征提取 Frequency Feature Extraction Based on Localized Differential Privacy 计算机科学, 2022, 49(7): 350-356. https://doi.org/10.11896/jsjkx.210900229
[10]	帅剑波, 王金策, 黄飞虎, 彭舰. 基于神经架构搜索的点击率预测模型 Click-Through Rate Prediction Model Based on Neural Architecture Search 计算机科学, 2022, 49(7): 10-17. https://doi.org/10.11896/jsjkx.210600009
[11]	张源, 康乐, 宫朝辉, 张志鸿. 基于Bi-LSTM的期货市场关联交易行为检测方法 Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM 计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[12]	高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[13]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[14]	张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[15]	曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed