计算机科学 ›› 2025, Vol. 52 ›› Issue (12): 321-330.doi: 10.11896/jsjkx.250300056

• 信息安全 • 上一篇    下一篇

基于API序列特征工程与特征学习的恶意代码检测方法

杨一哲, 芦天亮, 彭舒凡, 李啸林   

  1. 中国人民公安大学信息网络安全学院 北京 100038
  • 收稿日期:2025-03-11 修回日期:2025-06-04 出版日期:2025-12-15 发布日期:2025-12-09
  • 通讯作者: 芦天亮(lutianliang@ppsuc.edu.cn)
  • 作者简介:(694717399@qq.com)
  • 基金资助:
    公安部科技计划项目(2023JSM09)

Malware Detection Based on API Sequence Feature Engineering and Feature Learning

YANG Yizhe, LU Tianliang, PENG Shufan, LI Xiaolin   

  1. College of Information Network Security, People’s Public Security University of China, Beijing 100038, China
  • Received:2025-03-11 Revised:2025-06-04 Published:2025-12-15 Online:2025-12-09
  • About author:YANG Yizhe,born in 2001,postgra-duate,is a member of CCF(No.Z0786G).His main research interests include malware detection and so on.
    LU Tianliang,born in 1985,Ph.D,professor,Ph.D supervisor.His main research interests include cyber security and artificial intelligence.
  • Supported by:
    This work was supported by the Science and Technology Program of Ministry of Public Security(2023JSM09).

摘要: 基于API序列的恶意代码分析方法能够有效捕捉程序运行时的行为特征。然而,现有检测方法通常仅关注API名称,而忽略了参数以及返回值,或者难以充分挖掘它们的语义信息以及参数间的关联性,导致检测性能受限。为解决此问题,提出了一种结合系统化特征工程与深度神经网络架构的恶意代码检测方法。该方法针对API名称、参数及返回值的数据特性,对API序列实施结构化编码,继而通过多个RefConv卷积块来提取每个API调用的多尺度特征,最终将特征向量输入基于BiGRU-BiLSTM的并行循环神经网络,以学习API序列之间的长短期依赖关系。实验构建并开放了规模为2.5万的API序列数据集,在综合性能检测实验中,所提方法达到了93.55%的准确率;并通过时间概念漂移、空间概念漂移以及消融实验,验证了所提方法可以有效检测恶意代码。

关键词: 恶意代码检测, API序列, 特征工程, RefConv, BiGRU, BiLSTM

Abstract: API sequence-based malware analysis methods can effectively capture the behavioral characteristics of programs during runtime.However,existing detection approaches typically focus solely on API names while neglecting parameters and return values,or fail to adequately explore their semantic information and inter-parameter correlations,resulting in limited detection performance.To address this,this paper proposes a malware detection method combining systematic feature engineering with a deep neural network architecture.Specifically,the method implements structured encoding of API sequences based on the data characteristics of API names,parameters,and return values.Multiple RefConv convolutional blocks are then employed to extract multi-scale features for each API call.Finally,the feature vectors are fed into a parallel recurrent neural network based on BiGRU-BiLSTM to learn long-term and short-term dependencies within API sequences.Experiments conduct on a dataset containing 25 000 API sequences,this method achieves 93.55% accuracy in comprehensive performance tests.Validation through temporal concept drift,spatial concept drift,and ablation experiments demonstrates that the proposed method can effectively detect malware.

Key words: Malware detection, API sequence, Feature engineering, RefConv, BiGRU, BiLSTM

中图分类号: 

  • TP309
[1]SonicWall.2024 Mid-Year Cyber Threat Report[EB/OL].(2024-08-30) [2024-12-01].https://www.sonicwall.com/resources/white-papers/mid-year-2024-sonicwall-cyber-threat-report.
[2]AMER E,ZELINKA I.A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence[J].Computers & Security,2020,92:101760.
[3]KAKISIM A G,GULMEZ S,SOGUKPINAR I.Sequential opcode embedding-based malware detection method[J].Computers &Electrical Engineering,2022,98:107703.
[4]YAN J,YAN G,JIN D.Classifying malware represented as control flow graphs using deep graph convolutional neural network[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).IEEE,2019:52-63.
[5]GOPINATH M,SETHURAMAN S C.A comprehensive survey on deeplearning based malware detection techniques[J].Computer Science Review,2023,47:100529.
[6]DAMODARAN A,TROIA F D,VISAGGIO C A,et al.A comparison of static,dynamic,and hybrid analysis for malware detection[J].Journal of Computer Virology and Hacking Techniques,2017,13:1-12.
[7]GAO Q Q,SHI Z B,QIN Y M,et al.Interpretable malicious code detection method based on API sequence[J].Computer Engineering and Design,2023,44(6):1642-1648.
[8]CUI L,CUI J,JI Y,et al.Api2vec:Learning representations of api sequences for malware detection[C]//Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis.2023:261-273.
[9]WANG P,TANG Z,WANG J.A novel few-shot malware classification approach for unknown family recognition with multi-prototype modeling[J].Computers & Security,2021,106:102273.
[10]MANIRIHO P,MAHMOOD A N,CHOWDHURY M J M.API-MalDetect:Automated malware detection framework for windows based on API calls and deep learning techniques[J].Journal of Network and Computer Applications,2023,218:103704.
[11]AHMED F,HAMEED H,SHAFIQ M Z,et al.Using spatio-temporal information in API calls with machine learning algorithms for malware detection[C]//Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence.2009:55-62.
[12]CHEN X,HAO Z,LI L,et al.Cruparamer:Learning on parameter-augmented api sequences for malware detection[J].IEEE Transactions on Information Forensics and Security,2022,17:788-803.
[13]LI C,CHENG Z,ZHU H,et al.DMalNet:Dynamic malwareanalysis based on API feature engineering and graph learning[J].Computers & Security,2022,122:102872.
[14]ZHANG Z,QI P,WANG W.Dynamic malware analysis withfeature engineering and feature learning[C]//Proceedings of the AAAI Cconference on Artificial Intelligence.2020:1210-1217.
[15]GUERRA-MANZANARES A,LUCKNER M,BAHSI H.Concept drift and cross-device behavior:Challenges and implications for effective android malware detection[J].Computers & Secu-rity,2022,120:102757.
[16]CAI Z,DING X,SHEN Q,et al.Refconv:Re-parameterized refocusing convolution for powerful convnets[J].arXiv:2310.10563,2023.
[17]TRINIUS P,WILLEMS C,HOLZ T,et al.A malware instruction set for behavior-based analysis[C]//Sicherheit 2010.Sicherheit,Schutz und Zuverlässigkeit.Gesellschaft für Informatik eV,2010:205-215.
[18]QIAO Y,YANG Y,HE J,et al.CBM:free,automatic malware analysis framework using APIcall sequences[C]//Knowledge Engineering and Management:Proceedings of the Seventh International Conference on Intelligent Systems and Knowledge Engineering(ISKE 2012).Springer,2014:225-236.
[19]YESIR S,SOĞUKPINAR I·.Malware detection and classification using fasttext and bert[C]//2021 9th International Symposium on Digital Forensics and Security(ISDFS).IEEE,2021:1-6.
[20]WONG G W,HUANG Y T,GUO Y R,et al.Attention-based API locating for malware techniques[J].IEEE Transactions on Information Forensics and Security,2023,19:1199-1212.
[21]CUI L,YIN J,CUI J,et al.API2Vec++:Boosting API Sequence Representation for Malware Detection and Classification[J].IEEE Transactions on Software Engineering,2024,50(8):2142-2162.
[22]ZHOU B,HUANG H,XIA J,et al.A novel malware detection method based on API embedding and API parameters[J].The Journal of Supercomputing,2024,80(2):2748-2766.
[23]CHEN T,ZENG H,LYU M,et al.CTIMD:Cyber threat intelligence enhanced malware detection using API call sequences with parameters[J].Computers & Security,2024,136:103518.
[24]UPPAL D,SINHA R,MEHRA V,et al.Malware detection and classification based on extraction of API sequences[C]//2014 International Conference on Advances in Computing,Communications and Informatics(ICACCI).IEEE,2014:2337-2342.
[25]PASCANU R,STOKES J W,SANOSSIAN H,et al.Malware classification with recurrent networks[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2015:1916-1920.
[26]KOLOSNJAJI B,ZARRAS A,WEBSTER G,et al.Deep lear-ning for classification of malware systemcall sequences[C]//AI 2016:Advances in Artificial Intelligence:29th Australasian Joint Conference.Springer,2016:137-149.
[27]DAI Y,LI H,QIAN Y,et al.SMASH:A malware detection method based on multi-feature ensemble learning[J].IEEE Access,2019,7:112588-112597.
[28]SALEHI Z,SAMI A,GHIASI M.MAAR:Robust features to detect malicious activity based on API calls,their arguments and return values[J].Engineering Applications of Artificial Intelligence,2017,59:93-102.
[29]CERDA P,VAROQUAUX G,KÉGL B.Similarity encoding for learning with dirtycategorical variables[J].Machine Learning,2018,107(8):1477-1494.
[30]ZHU S,SHI J,YANG L,et al.Measuring and modeling the label dynamics of online {Anti-Malware} engines[C]//29th USENIX Security Symposium(USENIX Security 20).2020:2361-2378.
[31]KÜCHLER A,MANTOVANI A,HAN Y,et al.Does everysecond count? time-based evolution of malware behavior in sandboxes[C]//NDSS 2021,Network and Distributed Systems Security Symposium.Internet Society,2021.
[32]SEBASTIÁN S,CABALLERO J.Avclass2:Massive malwaretag extraction from av labels[C]//Proceedings of the 36th Annual Computer Security Applications Conference.2020:42-53.
[33]JIANG Y,LI G,LI S.TagClass:A tool for extracting class-determined tags from massive malware labels via incremental parsing[C]//2023 53rd Annual IEEE/IFIP International Confe-rence on Dependable Systems and Networks(DSN).IEEE,2023:193-200.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!