计算机科学 ›› 2026, Vol. 53 ›› Issue (4): 415-423.doi: 10.11896/jsjkx.250900139

• 信息安全 • 上一篇    下一篇

面向恶意流量识别的网络流量生成方法

张灿1, 栗维勋2, 汪明3, 詹雄3, 颉子光4, 韩东岐1, 王之梁1, 杨家海1   

  1. 1 清华大学网络科学与网络空间研究院 北京 100084
    2 国网河北省电力有限公司 石家庄 050031
    3 国家电网有限公司 北京 100031
    4 国网陕西省电力有限公司 西安 710048
  • 收稿日期:2025-09-23 修回日期:2025-12-18 出版日期:2026-04-15 发布日期:2026-04-08
  • 通讯作者: 韩东岐(handongqi@bupt.edu.cn)
  • 作者简介:(zhangcan25@mails.tsinghua.edu.cn)
  • 基金资助:
    国家电网有限公司科技项目(5108-202413050A-1-1-ZN)

Network Traffic Generation Method for Malicious Traffic Identification

ZHANG Can1, LI Weixun2, WANG Ming3, ZHAN Xiong3, XIE Ziguang4, HAN Dongqi1, WANG Zhiliang1, YANG Jiahai1   

  1. 1 Institute of Network Sciences and Network Space, Tsinghua University, Beijing 100084, China
    2 State Grid Hebei Electric Power Company, Shijiazhuang 050031, China
    3 State Grid Corporation of China, Beijing 100031, China
    4 State Grid Shaanxi Electric Power Company, Xi’an 710048, China
  • Received:2025-09-23 Revised:2025-12-18 Published:2026-04-15 Online:2026-04-08
  • About author:ZHANG Can,born in 2004,Ph.D candidate.Her main research interest is cybersecurity.
    HAN Dongqi,born in 1997,Ph.D,associate professor/reseacher,master’s supervisor,is a member of CCF(No.93935M).His main research interests include network and artificial intelligence security.
  • Supported by:
    Science and Technology Project of State Grid Corporation of China(5108-202413050A-1-1-ZN).

摘要: 恶意流量识别是网络安全防护中的关键任务,训练数据的质量直接决定识别模型的准确性。然而,受隐私保护、标注成本和类别不均衡等因素限制,真实数据获取十分困难。为解决上述挑战,提出了一种基于预训练-微调模型的细粒度网络流量生成方法。该方法首先设计了一种保留协议结构信息的静态分词方案,将原始流量转换为协议语义保持的可供自回归模型学习的序列表示。在此基础上,构建了预训练-微调的两阶段生成框架:先以大规模良性流量学习通用协议与时序模式,继而在标注的恶意流量上进行任务定向微调,生成具备明确攻击语义的高保真样本。为了验证流量生成方法的效果,设计了多个维度的实验评估,结果证明,所提方法在协议合规性(领域专家知识检查通过率高达99.95%)、分布相似性(生成/真实分布间推土机距离仅为0.005 9)及生成多样性(真实邻域覆盖度超过50%)均优于主流基准模型;在使用生成流量训练的恶意流量识别任务中,相较于基准方法,所提方法唯一实现了多种分类器的检测效果提升。此外,设计了恶意功能验证实验,在两种攻击场景下验证了所提方法生成流量的攻击效果。实验结果表明,所提方法能够生成语法合规、统计相似且语义功能正确的细粒度恶意流量,为解决网络安全领域流量数据稀缺问题提供了有效的技术途径。

关键词: 网络流量生成, 恶意流量识别, 生成式人工智能, 自回归模型, 预训练-微调

Abstract: Malicious traffic identification is a key task in cybersecurity,and the quality of training data directly determines the accuracy of detection models.However,obtaining real traffic data is challenging due to privacy concerns,high annotation costs,and class imbalance.To address these challenges,this paper proposes a fine-grained network traffic generation method based on a pre-training-fine-tuning paradigm.The method firstly introduces a static tokenization scheme that preserves protocol structure information,converting raw traffic into sequence representations that maintain protocol semantics and are suitable for autoregressive model learning.On this basis,a two-stage generation framework is constructed:pre-train on large-scale benign traffic to capture general protocol and temporal patterns,then fine-tune on task-specific labeled malicious traffic to generate high-fidelity samples with explicit attack semantics.To evaluate the effectiveness of the proposed method,multi-dimensional experiments are conducted.The results show that the method outperforms mainstream baselines in protocol compliance(achieving a 99.95% pass rate in expert knowledge checks),distribution similarity(with an Earth Mover’s Distance of 0.005 9 between generated and real distributions),and generation diversity(with real neighborhood coverage exceeding 50%).In malicious traffic identification tasks,the generated traffic uniquely improves the detection performance of multiple classifiers compared with baseline methods.In addition,malicious functionality verification experiments confirm that the generated traffic successfully reproduces attack effects in two attack scenarios.Overall,the results demonstrate that the proposed method can generate fine-grained malicious traffic that is syntactically compliant,statistically consistent,and semantically functional,providing an effective technical approach to alleviate the data scarcity problem in cybersecurity.

Key words: Network traffic generation, Malicious traffic identification, Generative AI, Autoregressive model, Pre-training and fine-tuning

中图分类号: 

  • TP311
[1]FU C,LI Q,SHEN M,et al.Realtime robust malicious traffic detection via frequency domain analysis[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.New York:ACM,2021:3431-3446.
[2]HAN D,WANG Z,CHEN W,et al.Anomaly Detection in the Open World:Normality Shift Detection,Explanation,and Adaptation[C]//30th Annual Network and Distributed System Security Symposium(NDSS).2023:1-18.
[3]LIAN X,CAO C,LIU Y,et al.Facing Anomalies Head-On:Network Traffic Anomaly Detection via Uncertainty-Inspired Inter-Sample Differences[C]//Proceedings of the ACM on Web Conference 2025.New York:ACM,2025:3908-3917.
[4]ZHAO Z,LI Z,SONG Z,et al.Trident:A universal framework for fine-grained and class-incremental unknown traffic detection[C]//Proceedings of the ACM Web Conference 2024.New York:ACM,2024:1608-1619.
[5]ZHOU G,GUO X,LIU Z,et al.TrafficFormer:An EfficientPre-trained Model for Traffic Data[C]//2025 IEEE Symposium on Security and Privacy(SP).San Francisco:IEEE Computer Society,2024:102-118.
[6]ADELEKE O A,BASTIN N,GURKAN D.Network trafficgeneration:A survey and methodology[J].ACM Computing Surveys,2022,2:1-23.
[7]DU Z,QIAN Y,LIU X,et al.GLM:General Language Model Pretraining with Autoregressive Blank Infilling[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.ACL,2022:320-335.
[8]DONG Y,DING J,JIANG X,et al.Codescore:Evaluating code generation by learning code execution[J].ACM Transactions on Software Engineering and Methodology,2025,3:1-22.
[9]LI J,LI G,LI Y,et al.Structured chain-of-thought promptingfor code generation[J].ACM Transactions on Software Engineering and Methodology,2025,2:1-23.
[10]HENDERSON T R,LACAGE M,RILEY G F,et al.Networksimulations with the ns-3 simulator[J].SIGCOMMDemonstration,2008,4:527-527.
[11]BÜHLER T,SCHMID R,LUTZ S,et al.Generating representative,live network traffic out of millions of code repositories[C]//Proceedings of the 21st ACM Workshop on Hot Topics in Networks.New York:ACM,2022:1-7.
[12]ROLLAND C,RIDOUX J,BAYNAT B.LiTGen,a lightweight traffic generator:application to P2P and mail wireless traffic[C]//Passive and Active Network Measurement:8th Internatinoal Conference.Berlin:Springer,2007:52-62.
[13]Naval Research Laboratory.Multi-Generator(MGEN)[EB/OL].(2021-08-25)[2025-09-19].https://www.nrl.navy.mil/itd/ncs/products/mgen.
[14]CHU A,JIANG X,LIU S,et al.Feasibility of state space models for network traffic generation[C]//Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing.New York:ACM,2024:9-17.
[15]RING M,SCHLÖR D,LANDES D,et al.Flow-based network traffic generation using generative adversarial networks[J].Computers & Security,2019,82:156-172.
[16]LIN Z,JAIN A,WANG C,et al.Using gans for sharing networked time series data:Challenges,initial promise,and open questions[C]//Proceedings of the ACM Internet Measurement Conference.New York:ACM,2020:464-483.
[17]YIN Y,LIN Z,JIN M,et al.Practical gan-based synthetic ipheader trace generation using netshare[C]//Proceedings of the ACM SIGCOMM 2022 Conference.New York:ACM,2022:458-472.
[18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances inNeural Information Processing Systems,2017,30:6000-6010.
[19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8):9.
[20]MENG X,LIN C,WANG Y,et al.Netgpt:Generative pre-trained transformer for network traffic[J].arXiv:2304.09513,2023.
[21]WANG Q,QIAN C,LI X,et al.Lens:A foundation model for network traffic in cybersecurity[J].arXiv:2402.03646,2024.
[22]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances inNeural Information Processing Systems,2020,33:6840-6851.
[23]SIVAROOPAN N,BANDARA D,MADARASINGHA C,et al.Netdiffus:Network traffic generation by diffusion models through time-series imaging[J].Computer Networks,2024,251:1-13.
[24]ZHANG S,LI T,JIN D,et al.NetDiff:A service-guided hierarchical diffusion model for network flow trace generation[C]//Proceedings of the ACM on Networking.2024:1-21.
[25]JIANG X,LIU S,GEMBER-JACOBSON A,et al.Netdiffusion:Network data augmentation through protocol-constrained traffic generation[J].Proceedings of the ACM on Measurement and Analysis of Computing Systems,2024,8(1):1-32.
[26]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27:2672-2680.
[27]NETOEC P,DADKHAH S,FERREIRA R,et al.CICIoT2023:A real-time dataset and benchmark for large-scale attacks in IoT environment[J].Sensors,2023,23(13):5941-5967.
[1] 王中原, 王宝山, 王拥军, 袁天浩.
生成式人工智能在视频处理领域的应用综述
Review of Applications of Artificial Intelligence Generated Content in Video Processing
计算机科学, 2025, 52(11A): 241200164-10. https://doi.org/10.11896/jsjkx.241200164
[2] 袁天浩, 王拥军, 王宝山, 王中原.
生成式人工智能在自然语言处理中的应用综述
Review of Artificial Intelligence Generated Content Applications in Natural Language Processing
计算机科学, 2025, 52(11A): 241200156-12. https://doi.org/10.11896/jsjkx.241200156
[3] 陈康, 林建涵, 刘元杰.
图像去模糊算法研究综述
Survey on Image Deblurring Algorithms
计算机科学, 2025, 52(11): 98-112. https://doi.org/10.11896/jsjkx.241200045
[4] 李嘉晖, 张萌萌, 陈洪辉.
大模型驱动多智能体的军事需求生成框架
Large Language Models Driven Framework for Multi-agent Military Requirement Generation
计算机科学, 2025, 52(1): 65-71. https://doi.org/10.11896/jsjkx.240800022
[5] 颜玉松, 周圆, 王琮, 孔圣麒, 王权, 黎敏讷, 王之元.
基于预训练大模型的行动方案生成方法
COA Generation Based on Pre-trained Large Language Models
计算机科学, 2025, 52(1): 80-86. https://doi.org/10.11896/jsjkx.240900075
[6] 董红斌, 韩爽, 付强.
基于AR与DNN联合模型的地理传感器时间序列预测
Geo-sensory Time Series Prediction Based on Joint Model of Auto Regression and Deep NeuralNetwork
计算机科学, 2023, 50(11): 41-48. https://doi.org/10.11896/jsjkx.230500231
[7] 廖仁健, 周丽华, 肖清, 杜国王.
基于knnVAR模型的地理传感数据预测
Prediction of Geosensor Data Based on knnVAR Model
计算机科学, 2018, 45(11A): 431-435.
[8] .
基于小波变换与自回归模型的网络流量预测

计算机科学, 2007, 34(7): 47-49.
[9] .
一种噪声和畸变混沌信号的滤波策略--Ⅰ:盲信道均衡

计算机科学, 2006, 33(9): 61-65.
[10] .
一种噪声和畸变混沌信号的滤波策略-Ⅱ:自适应解调

计算机科学, 2006, 33(10): 71-73.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!