计算机科学 ›› 2024, Vol. 51 ›› Issue (11A): 231200165-6.doi: 10.11896/jsjkx.231200165

• 信息安全 • 上一篇    下一篇

基于改进近端策略优化算法的智能渗透路径研究

王紫阳, 王佳, 熊明亮, 王文涛   

  1. 新疆大学计算机科学与技术学院 乌鲁木齐 830000
    新疆维吾尔自治区多语种信息技术重点实验室 乌鲁木齐 830000
  • 出版日期:2024-11-16 发布日期:2024-11-13
  • 通讯作者: 王佳(jw1024@xju.edu.cn)
  • 作者简介:(107552103703@stu.xju.edu.cn)
  • 基金资助:
    新一代人工智能国家科技重大专项(2022ZD0115803);新疆维吾尔自治区重点研发计划项目(2022B01008);国家自然科学基金项目(62363032);新疆维吾尔自治区自然科学基金项目;新疆维吾尔自治区教育厅项目(XJEDU2022P011);“天池博士”计划项目(202104120018)

Intelligent Penetration Path Based on Improved PPO Algorithm

WANG Ziyang, WANG Jia, XIONG Mingliang, WANG Wentao   

  1. School of Computer Science and Technology,Xinjiang University,Urumqi 830000,China
    Xinjiang Key Laboratory of Multilingual Information Technology,Urumqi 830000,China
  • Online:2024-11-16 Published:2024-11-13
  • About author:WANG Ziyang,born in 1996,postgra-duate.His main research interests include reinforce learning and cyberspace security.
    WANG Jia,born in 1987,Ph.D,asso-ciate professor,is a member of CCF(No.K8521M).Her main research interests include resource allocation in clouds,tasks scheduling in big data and cyberspace security.
  • Supported by:
    National Science and Technology Major Project(2022ZD0115803),Key Research and Development Program of Xinjiang Uygur Autonomous Region(2022B01008),National Natural Science Foundation of China(62363032),Natural Science Foundation of Xinjiang Uygur Autonomous Region(2023D01C20),Scientific Research Foundation of Higher Education(XJEDU2022P011) and “Heaven Lake Doctor” Project(202104120018).

摘要: 渗透路径规划是渗透测试的首要步骤,对实现渗透测试的自动化有重大意义。现有渗透路径规划研究多将渗透测试建模为完全可观测的理想过程,难以准确反映部分可观测性的实际渗透测试过程。鉴于强化学习在渗透测试领域的广泛应用,将渗透测试过程建模为部分可观测的马尔可夫决策过程,从而更准确地模拟实际渗透测试过程。在此基础上,针对PPO算法使用全连接层拟合策略函数和价值函数无法提取部分可观测空间有效特征的问题,提出一种改进的PPO算法RPPO,其中策略网络和评估网络均融合全连接层和LSTM网络结构以提升其在未知环境提取特征的能力。同时,给出一种新的目标函数更新方法,以增强算法的鲁棒性和收敛性。实验结果表明,在不同网络场景中,相较于现有A2C,PPO和NDSPI-DQN算法,RPPO算法收敛轮次分别缩短了21.21%,28.64%,22.85%,获得累计奖励分别提升了66.01%,58.61%,132.64%,更适用于超过50台主机的较大规模网络环境。

关键词: 渗透测试, 渗透路径规划, 强化学习, 近端策略优化, 长短期记忆网络

Abstract: Penetration path planning is the first step of penetration testing,which is important for the intelligent penetration testing.Existing studies on penetration path planning always model penetration testing as a full observable process,which is difficult to describe the actual penetration testing with partial observability accurately.With the wide application of reinforcement learning in penetration testing,this paper models the penetration testing as a partially observable Markov decision process to simulate the practical penetration testing accurately.In general,the full connection of policy network and evaluation network in PPO cannot extract features effectively in penetration testing with partial observability.This paper proposes an improved PPO algorithm RPPO,which integrating of full connection and long short term memory(LSTM) in the policy network and evaluation network.In addition,a new objective function updating is designed to improve the robustness and convergence.Experimental results show that,the proposed RPPO converges faster than A2C,PPO and NDSPI-DQN algorithms.Especially,the convergence iterations is reduced by 21.21%,28.64% and 22.85% respectively.Meanwhile RPPO gains more cumulative reward about 66.01%,58.61% and 132.64%,which is more suitable for larger-scale network environments with more than fifty hosts.

Key words: Penetration testing, Penetration path planning, Reinforcement learning, Proximal policy optimization, Long and short term memory networks

中图分类号: 

  • TP309
[1]ARKIN B,STENDER S,MCGRAW G.Software PenetrationTesting[J].IEEE Security & Privacy,2005,3(1):84-87.
[2]SARRAUTE C,RICHARTE G,LUCÁNGELI OBES J.An Algorithm to Find Optimal Attack Paths in Nondeterministic Scenarios[C]//Proc.of the 4th ACM Workshop on Security and Artificial Intelligence.Chicago,US,2011:71-80.
[3]SILVER D,HUANG A,MADDISON C J,et al.Mastering the Game of Go with Deep Neural Networks and Tree Search[J].Nature,2016,529(7587):484-489.
[4]WARRINGTON A,LAVINGTON J W,SCIBIOR A,et al.Ro-bust Asymmetric Learning in Pomdps[C]// Proc.of the 38th International Conference on Machine Learning(PMLR).New York,US,2021:11013-11023.
[5]VAN OTTERLO M,WIERING M.Reinforcement Learning and Markov Decision Processes[M]//Reinforcement learning:State-of-the-art.Berlin,Heidelberg:Springer,2012:3-42.
[6]MCKINNEL D R,DARGAHI T,DEHGHANTANHA A,et al.A Systematic Literature Review and Meta-analysis on Artificial Intelligence in Penetration Testing and Vulnerability Assessment[J].Computers & Electrical Engineering,2019,75:175-188.
[7]MAEDA R,MIMURA M.Automating Post-exploitation withDeep Reinforcement Learning[J].Computers & Security,2021,100:102-108.
[8]LADOSZ P,BEN-IWHIWHU E,DICK J,et al.Deep reinforcement learning with modulated hebbian plus Q-network architecture[J].IEEE Transactions on Neural Networks and Learning Systems,2021,33(5):2045-2056.
[9]SCHWARTZ J,KURNIAWATI H.Autonomous PenetrationTesting Using Reinforcement Learning[J].arXiv:1905.05965,2019.
[10]ZENNARO F M,ERDÖDI L.Modelling Penetration Testingwith Reinforcement Learning Using Capture-the-flag Challenges:Trade-offs between Model-free Learning and A Priori knowledge[J].IET Information Security,2023,17(3):441-457.
[11]ZHOU S,LIU J,HOU D,et al.Autonomous penetration testing based on improved deep q-network[J].Applied Sciences,2021,11(19):8823.
[12]ZHANG G M,ZHANG S Y,ZHANG J W.Attack Path Disco-very and Optimization Method Based on PPO Algorithm[J].Information Network Security,2023,23(9):47-57.
[13]CHEN J,HU S,ZHENG H,et al.GAIL-PT:An IntelligentPenetration Testing Framework with Generative Adversarial Imitation Learning[J].Computers & Security,2023,126:103055.
[14]ZHOU T,ZANG Y,ZHU J,et al.NIG-AP:A New Method For Automated Penetration Testing[J].Frontiers of Information Technology & Electronic Engineering,2019,20(9):1277-1288.
[15]CODY T.A Layered Reference Model for Penetration Testing with Reinforcement Learning and Attack Graphs[C]//Proc.of 2022 IEEE 29th Annual Software Technology Conference(STC).Gaithersburg,MD,USA,IEEE,2022:41-50.
[16]NGUYEN H V,TEERAKANOK S,INOMATA A,et al.The Proposal of Double Agent Architecture using Actor-critic Algorithm for Penetration Testing[C]//ICISSP.2021:440-449.
[17]SARRAUTE C,BUFFET O,HOFFMANN J.POMDPs Make Better Hackers:Accounting for Uncertainty in Penetration Testing[C]//Proc.of the 26th AAAI Conference on Artificial Intelligence.Toronto,Ontario,Canada,2012:1816-1824.
[18]ZHANG Y,LIU J,ZHOU S,et al.Improved Deep Recurrent Q-Network of POMDPs for Automated Penetration Testing[J].Applied Sciences,2022,12(20):10339.
[19]GHANEM M C,CHEN T M.Reinforcement Learning for Efficient Network Penetration Testing[J].Information,2019,11(1):6.
[20]GHANEM M C,CHEN T M,NEPOMUCENO E G.Hierarchical Reinforcement Learning for Efficient and Effective Automated Penetration Testing of Large Networks[J].Journal of Intelligent Information Systems,2023,60(2):281-303.
[21]KORONIOTIS N,MOUSTAFA N,TURNBULL B,et al.Adeep learning-based penetration testing framework for vulnerability identification in internet of things environments[C]//2021 IEEE 20th International Conference on Trust,Security and Privacy in Computing and Communications(TrustCom).IEEE,2021:887-894.
[22]SCHWARTZ J,KURNIAWATI H,EL-MAHASSNI E.Pomdp+information-decay:Incorporating defender's behaviour in autonomous penetration testing[C]//Proceedings of the International Conference on Automated Planning and Scheduling.2020,30:235-243.
[23]ZHOU S,LIU J,HOU D,et al.Autonomous Penetration Tes-ting Based on Improved Deep Q-network[J].Applied Sciences,2021,11(19):8823.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!