Computer Science ›› 2026, Vol. 53 ›› Issue (4): 415-423.doi: 10.11896/jsjkx.250900139

• Information Security • Previous Articles     Next Articles

Network Traffic Generation Method for Malicious Traffic Identification

ZHANG Can1, LI Weixun2, WANG Ming3, ZHAN Xiong3, XIE Ziguang4, HAN Dongqi1, WANG Zhiliang1, YANG Jiahai1   

  1. 1 Institute of Network Sciences and Network Space, Tsinghua University, Beijing 100084, China
    2 State Grid Hebei Electric Power Company, Shijiazhuang 050031, China
    3 State Grid Corporation of China, Beijing 100031, China
    4 State Grid Shaanxi Electric Power Company, Xi’an 710048, China
  • Received:2025-09-23 Revised:2025-12-18 Online:2026-04-15 Published:2026-04-08
  • About author:ZHANG Can,born in 2004,Ph.D candidate.Her main research interest is cybersecurity.
    HAN Dongqi,born in 1997,Ph.D,associate professor/reseacher,master’s supervisor,is a member of CCF(No.93935M).His main research interests include network and artificial intelligence security.
  • Supported by:
    Science and Technology Project of State Grid Corporation of China(5108-202413050A-1-1-ZN).

Abstract: Malicious traffic identification is a key task in cybersecurity,and the quality of training data directly determines the accuracy of detection models.However,obtaining real traffic data is challenging due to privacy concerns,high annotation costs,and class imbalance.To address these challenges,this paper proposes a fine-grained network traffic generation method based on a pre-training-fine-tuning paradigm.The method firstly introduces a static tokenization scheme that preserves protocol structure information,converting raw traffic into sequence representations that maintain protocol semantics and are suitable for autoregressive model learning.On this basis,a two-stage generation framework is constructed:pre-train on large-scale benign traffic to capture general protocol and temporal patterns,then fine-tune on task-specific labeled malicious traffic to generate high-fidelity samples with explicit attack semantics.To evaluate the effectiveness of the proposed method,multi-dimensional experiments are conducted.The results show that the method outperforms mainstream baselines in protocol compliance(achieving a 99.95% pass rate in expert knowledge checks),distribution similarity(with an Earth Mover’s Distance of 0.005 9 between generated and real distributions),and generation diversity(with real neighborhood coverage exceeding 50%).In malicious traffic identification tasks,the generated traffic uniquely improves the detection performance of multiple classifiers compared with baseline methods.In addition,malicious functionality verification experiments confirm that the generated traffic successfully reproduces attack effects in two attack scenarios.Overall,the results demonstrate that the proposed method can generate fine-grained malicious traffic that is syntactically compliant,statistically consistent,and semantically functional,providing an effective technical approach to alleviate the data scarcity problem in cybersecurity.

Key words: Network traffic generation, Malicious traffic identification, Generative AI, Autoregressive model, Pre-training and fine-tuning

CLC Number: 

  • TP311
[1]FU C,LI Q,SHEN M,et al.Realtime robust malicious traffic detection via frequency domain analysis[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.New York:ACM,2021:3431-3446.
[2]HAN D,WANG Z,CHEN W,et al.Anomaly Detection in the Open World:Normality Shift Detection,Explanation,and Adaptation[C]//30th Annual Network and Distributed System Security Symposium(NDSS).2023:1-18.
[3]LIAN X,CAO C,LIU Y,et al.Facing Anomalies Head-On:Network Traffic Anomaly Detection via Uncertainty-Inspired Inter-Sample Differences[C]//Proceedings of the ACM on Web Conference 2025.New York:ACM,2025:3908-3917.
[4]ZHAO Z,LI Z,SONG Z,et al.Trident:A universal framework for fine-grained and class-incremental unknown traffic detection[C]//Proceedings of the ACM Web Conference 2024.New York:ACM,2024:1608-1619.
[5]ZHOU G,GUO X,LIU Z,et al.TrafficFormer:An EfficientPre-trained Model for Traffic Data[C]//2025 IEEE Symposium on Security and Privacy(SP).San Francisco:IEEE Computer Society,2024:102-118.
[6]ADELEKE O A,BASTIN N,GURKAN D.Network trafficgeneration:A survey and methodology[J].ACM Computing Surveys,2022,2:1-23.
[7]DU Z,QIAN Y,LIU X,et al.GLM:General Language Model Pretraining with Autoregressive Blank Infilling[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.ACL,2022:320-335.
[8]DONG Y,DING J,JIANG X,et al.Codescore:Evaluating code generation by learning code execution[J].ACM Transactions on Software Engineering and Methodology,2025,3:1-22.
[9]LI J,LI G,LI Y,et al.Structured chain-of-thought promptingfor code generation[J].ACM Transactions on Software Engineering and Methodology,2025,2:1-23.
[10]HENDERSON T R,LACAGE M,RILEY G F,et al.Networksimulations with the ns-3 simulator[J].SIGCOMMDemonstration,2008,4:527-527.
[11]BÜHLER T,SCHMID R,LUTZ S,et al.Generating representative,live network traffic out of millions of code repositories[C]//Proceedings of the 21st ACM Workshop on Hot Topics in Networks.New York:ACM,2022:1-7.
[12]ROLLAND C,RIDOUX J,BAYNAT B.LiTGen,a lightweight traffic generator:application to P2P and mail wireless traffic[C]//Passive and Active Network Measurement:8th Internatinoal Conference.Berlin:Springer,2007:52-62.
[13]Naval Research Laboratory.Multi-Generator(MGEN)[EB/OL].(2021-08-25)[2025-09-19].https://www.nrl.navy.mil/itd/ncs/products/mgen.
[14]CHU A,JIANG X,LIU S,et al.Feasibility of state space models for network traffic generation[C]//Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing.New York:ACM,2024:9-17.
[15]RING M,SCHLÖR D,LANDES D,et al.Flow-based network traffic generation using generative adversarial networks[J].Computers & Security,2019,82:156-172.
[16]LIN Z,JAIN A,WANG C,et al.Using gans for sharing networked time series data:Challenges,initial promise,and open questions[C]//Proceedings of the ACM Internet Measurement Conference.New York:ACM,2020:464-483.
[17]YIN Y,LIN Z,JIN M,et al.Practical gan-based synthetic ipheader trace generation using netshare[C]//Proceedings of the ACM SIGCOMM 2022 Conference.New York:ACM,2022:458-472.
[18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances inNeural Information Processing Systems,2017,30:6000-6010.
[19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8):9.
[20]MENG X,LIN C,WANG Y,et al.Netgpt:Generative pre-trained transformer for network traffic[J].arXiv:2304.09513,2023.
[21]WANG Q,QIAN C,LI X,et al.Lens:A foundation model for network traffic in cybersecurity[J].arXiv:2402.03646,2024.
[22]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances inNeural Information Processing Systems,2020,33:6840-6851.
[23]SIVAROOPAN N,BANDARA D,MADARASINGHA C,et al.Netdiffus:Network traffic generation by diffusion models through time-series imaging[J].Computer Networks,2024,251:1-13.
[24]ZHANG S,LI T,JIN D,et al.NetDiff:A service-guided hierarchical diffusion model for network flow trace generation[C]//Proceedings of the ACM on Networking.2024:1-21.
[25]JIANG X,LIU S,GEMBER-JACOBSON A,et al.Netdiffusion:Network data augmentation through protocol-constrained traffic generation[J].Proceedings of the ACM on Measurement and Analysis of Computing Systems,2024,8(1):1-32.
[26]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27:2672-2680.
[27]NETOEC P,DADKHAH S,FERREIRA R,et al.CICIoT2023:A real-time dataset and benchmark for large-scale attacks in IoT environment[J].Sensors,2023,23(13):5941-5967.
[1] LI Bo, MO Xian. Application of Large Language Models in Recommendation System [J]. Computer Science, 2025, 52(6A): 240400097-7.
[2] LI Jiahui, ZHANG Mengmeng, CHEN Honghui. Large Language Models Driven Framework for Multi-agent Military Requirement Generation [J]. Computer Science, 2025, 52(1): 65-71.
[3] YAN Yusong, ZHOU Yuan, WANG Cong, KONG Shengqi, WANG Quan, LI Minne, WANG Zhiyuan. COA Generation Based on Pre-trained Large Language Models [J]. Computer Science, 2025, 52(1): 80-86.
[4] ZHANG Xue, LUO Zhi-hong, JIANG Jing. Comparison of Temperature Forecasting Model Using in Weather Derivatives Designing [J]. Computer Science, 2021, 48(6A): 169-177.
[5] CHEN Ting-ting,WANG Ying-ming. Structure Identification of Belief-rule-base Based on AR Model [J]. Computer Science, 2018, 45(6A): 79-84.
[6] LIAO Ren-jian, ZHOU Li-hua, XIAO Qing, DU Guo-wang. Prediction of Geosensor Data Based on knnVAR Model [J]. Computer Science, 2018, 45(11A): 431-435.
[7] . Software Aging Detection Based on Nonlinear Multiparameter Model [J]. Computer Science, 2013, 40(1): 161-165.
[8] . [J]. Computer Science, 2007, 34(7): 47-49.
[9] . [J]. Computer Science, 2006, 33(9): 61-65.
[10] . [J]. Computer Science, 2006, 33(10): 71-73.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!