Computer Science ›› 2025, Vol. 52 ›› Issue (11): 40-48.doi: 10.11896/jsjkx.241100118

• Research and Application of Large Language Model Technology • Previous Articles     Next Articles

Instruct-Malware:Control Flow Graph Based Large Language Model Analysis of Malware

ZHOU Yuchen1, LI Peng1,2, HAN Keji1,2   

  1. 1 School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
    2 Institute of Network Security and Trusted Computing,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
  • Received:2024-11-20 Revised:2025-03-18 Online:2025-11-15 Published:2025-11-06
  • About author:ZHOU Yuchen,born in 2000,postgra-duate.His main research interests include malicious code analysis and multimodal large language models.
    LI Peng,born in 1979,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.48573M).His main research interests include computer communication networks,cloud computing and information security.
  • Supported by:
    Six Talent Peaks Project of Jiangsu Province(RJFW-111) and Postgraduate Research and Practice Innovation Program of Jiangsu Province(KYCX24_1227,KYCX23_1048).

Abstract: Malware detection and classification face challenges due to their complexity and stealthiness.Although GNNs can effectively model control flow graphs,thereby enhancing the accuracy of behavioral pattern recognition,their “black-box” nature limits interpretability.Moreover,existing methods rely heavily on large amounts of labeled data,resulting in weaker generalization capabilities and difficulties in addressing novel malware.LLMs possess strong feature extraction and contextual understanding abilities,capable of efficiently processing few-shot data and achieving multimodal information fusion,thus enhancing analytical precision and generalizability.Inspired by large language models and leveraging contrastive learning strategies,this paper aims to simultaneously learn the structure of control flow graphs and assembly instructions,thereby enhancing the effectiveness and flexibility of malware analysis.Based on this,this paper designs the Instruct-Malware framework,which employs lightweight graph-text alignment projection and two-stage instruction optimization,significantly enhancing the flexibility and robustness of malware analysis.Additionally,the interpretability of the model has been improved,clarifying the decision-makingprocess.Experimental results demonstrate that the proposed framework exhibits significant performance improvements in malware classification and subgraph recognition tasks,surpassing current mainstream approaches and substantially narrowing the gap with specialized mo-dels.This provides new insights into building an efficient and reliable malware analysis system.

Key words: Malware analysis, Graph neural networks, Large language models, Contrastive learning

CLC Number: 

  • TP319
[1]Av-Test.Malware Statistics & Trends Report by Av-Test[EB/OL].https://www.av-test.org/en/statistics/malware.
[2]SHARMA O,SHARMA A,KALIA A.Windows and IoT Malware Visualization and Classification with Deep CNN and Xception CNN Using Markov Images[J].Journal of Intelligent Information Systems,2023,60(2):349-375.
[3]ALZUBI O A,ALZUBI J A,ALZUBI T M,et al.QuantumMayfly Optimization with Encoder-Decoder Driven LSTM Networks for Malware Detection and Classification Model[J].Mobile Networks and Applications,2023,28(2):795-807.
[4]XIAO G Q,LI X Q,CHEN Y D,et al.A Review of Large-Scale Graph Neural Networks[J].Journal of Computer Science,2024,47(1):148-171.
[5]YAN J,YAN G,JIN D.Classifying Malware Represented asControl Flow Graphs Using Deep Graph Convolutional Neural Network[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).2019:52-63.
[6]WU B,XU Y,ZOU F.Malware Classification by Learning Semantic and Structural Features of Control Flow Graphs[C]//2021 IEEE 20th International Conference on Trust,Security and Privacy in Computing and Communications.2021:540-547.
[7]YING Z,BOURGEOIS D,YOU J,et al.Gnnexplainer:Generating Explanations for Graph Neural Networks[C]//Advances in Neural Information Processing Systems.2019:9240-9251.
[8]YUAN H,YU H,WANG J,et al.On Explainability of GraphNeural Networks Via Subgraph Explorations[C]//International Conference on Machine Learning.2021:12241-12252.
[9]HERATH J D,WAKODIKAR P P,YANG P,et al.Cfgexplainer:Explaining Graph Neural Network-Based Malware Classification From Control Flow Graphs[C]//2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).2022:172-184.
[10]LUO D,CHENG W,XU D,et al.Parameterized Explainer for Graph Neural Network[C]//Advances in Neural Information Processing Systems.2020:19620-19631.
[11]ZENG A,LIU X,DU Z,et al.Glm-130B:An Open Bilingual Pre-Trained Model[C]//Proceedings of the International Conference on Learning Representations.2023:1-56.
[12]Openai.Chatgpt:A Language Model for Conversational AI[EB/OL].https://chatgpt.com/.
[13]LIU H,LI C,WU Q,et al.Visual Instruction Tuning[C]//Advances in Neural Information Processing Systems.2024:34892-34916.
[14]ZHU D,CHEN J,SHEN X,et al.Minigpt-4:Enhancing Vision-Language Understanding with Advanced Large Language Mo-dels[C]//Proceedings of the International Conference on Lear-ning Representations.2024:1-17.
[15]YE Q,XU H,XU G,et al.Mplug-Owl:Modularization Empowers Large Language Models with Multimodality[J].arXiv:2304.14178,2023.
[16]WEN Z,FANG Y.Prompt Tuning On Graph-Augmented Low-Resource Text Classification[J].IEEE Transactions on Know-ledge and Data Engineering,2024,36(12):9080-9095.
[17]ZHANG H,LI X,BING L.Video-Llama:An Instruction-Tuned Audio-Visual Language Model for Video Understanding[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.2023:543-553.
[18]TANG J,YANG Y,WEI W,et al.Graphgpt:Graph Instruction Tuning for Large Language Models[C]//Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval.2024:491-500.
[19]WANG Y,KORDI Y,MISHRA S,et al.Self-Instruct:Aligning Language Models with Self-Generated Instructions[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:13484-13508.
[20]VASWANI A.Attention is All You Need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[21]RADFORD A,KIM J W,HALLACY C,et al.Learning Transferable Visual Models From Natural Language Supervision[C]//International Conference on Machine Learning.2021:8748-8763.
[22]WEN Z,FANG Y.Augmenting Low-Resource Text Classifica-tion with Graph-Grounded Pre-Training and Prompting[C]//Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.2023:506-516.
[23]LIU S,NIE W,WANG C,et al.Multi-Modal Molecule Structure-Text Model for Text-Based Retrieval and Editing[J].Nature Machine Intelligence,2023,5(12):1447-1457.
[24]LU Y,PENG J,ZHU Y,et al.Pre-Training Molecular GraphRepresentations with Motif-Enhanced Message Passing[C]//2024 International Joint Conference on Neural Networks(IJCNN).2024:1-8.
[25]OORD A V D,LI Y,VINYALS O.Representation Learningwith Contrastive Predictive Coding[J].arXiv:1807.03748,2018.
[26]HU E J,SHEN Y,WALLIS P,et al.Lora:Low-Rank Adaptation of Large Language Models[C]//Proceedings of the International Conference on Learning Representations.2022:1-13.
[27]HAMILTON W,YING Z,LESKOVEC J.Inductive Representation Learning On Large Graphs[C]//Advances in Neural Information Processing Systems.2017:1024-1034.
[28]KIPF T N,WELLING M.Semi-Supervised Classification withGraph Convolutional Networks[C]//Proceedings of the International Conference on Learning Representations.2017:1-14.
[29]VELIČKOVIĆ P,CUCURULL G,CASANOVA A,et al.Graph Attention Networks[C]//Proceedings of the International Conference on Learning Representations.2018:1-12.
[30]VELIČKOVIĆ P,FEDUS W,HAMILTON W L,et al.Deep Graph Infomax[C]//Proceedings of the International Confe-rence on Learning Representations.2018:1-17.
[31]ZHANG S,LIU Y,SUN Y,et al.Graph-Less Neural Networks:Teaching Old Mlps New Tricks Via Distillation[C]//Procee-dings of the International Conference on Learning Representations.2022:1-21.
[32]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-Thought Prompting Elicits Reasoning in Large Language Models[J].Advances in Neural Information Processing Systems,2022,35:24824-24837.
[33]LIU H,LI C,LI Y,et al.Improved Baselines with Visual Instruction Tuning[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2024:26296-26306.
[1] LI Yaru, WANG Qianqian, CHE Chao, ZHU Deheng. Graph-based Compound-Protein Interaction Prediction with Drug Substructures and Protein 3D Information [J]. Computer Science, 2025, 52(9): 71-79.
[2] HUANG Chao, CHENG Chunling, WANG Youkang. Source-free Domain Adaptation Method Based on Pseudo Label Uncertainty Estimation [J]. Computer Science, 2025, 52(9): 212-219.
[3] GUO Husheng, ZHANG Xufei, SUN Yujie, WANG Wenjian. Continuously Evolution Streaming Graph Neural Network [J]. Computer Science, 2025, 52(8): 118-126.
[4] ZHANG Shiju, GUO Chaoyang, WU Chengliang, WU Lingjun, YANG Fengyu. Text Clustering Approach Based on Key Semantic Driven and Contrastive Learning [J]. Computer Science, 2025, 52(8): 171-179.
[5] LUO Xuyang, TAN Zhiyi. Knowledge-aware Graph Refinement Network for Recommendation [J]. Computer Science, 2025, 52(7): 103-109.
[6] HAO Jiahui, WAN Yuan, ZHANG Yuhang. Research on Node Learning of Graph Neural Networks Fusing Positional and StructuralInformation [J]. Computer Science, 2025, 52(7): 110-118.
[7] JIANG Kun, ZHAO Zhengpeng, PU Yuanyuan, HUANG Jian, GU Jinjing, XU Dan. Cross-modal Hypergraph Optimisation Learning for Multimodal Sentiment Analysis [J]. Computer Science, 2025, 52(7): 210-217.
[8] LI Maolin, LIN Jiajie, YANG Zhenguo. Confidence-guided Prompt Learning for Multimodal Aspect-level Sentiment Analysis [J]. Computer Science, 2025, 52(7): 241-247.
[9] ZHANG Taotao, XIE Jun, QIAO Pingjuan. Specific Emitter Identification Based on Progressive Self-training Open Set Domain Adaptation [J]. Computer Science, 2025, 52(7): 279-286.
[10] CHEN Jinyin, XI Changkun, ZHENG Haibin, GAO Ming, ZHANG Tianxin. Survey of Security Research on Multimodal Large Language Models [J]. Computer Science, 2025, 52(7): 315-341.
[11] LI Bo, MO Xian. Application of Large Language Models in Recommendation System [J]. Computer Science, 2025, 52(6A): 240400097-7.
[12] HU Caishun. Study on Named Entity Recognition Algorithms in Audit Domain Based on Large LanguageModels [J]. Computer Science, 2025, 52(6A): 240700190-4.
[13] YE Jiale, PU Yuanyuan, ZHAO Zhengpeng, FENG Jue, ZHOU Lianmin, GU Jinjing. Multi-view CLIP and Hybrid Contrastive Learning for Multimodal Image-Text Sentiment Analysis [J]. Computer Science, 2025, 52(6A): 240700060-7.
[14] FU Shufan, WANG Zhongqing, JIANG Xiaotong. Zero-shot Stance Detection in Chinese by Fusion of Emotion Lexicon and Graph ContrastiveLearning [J]. Computer Science, 2025, 52(6A): 240500051-7.
[15] LI Jianghui, DING Haiyan, LI Weihua. Prediction of Influenza A Antigenicity Based on Few-shot Contrastive Learning [J]. Computer Science, 2025, 52(6A): 240800053-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!