计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 1-10.doi: 10.11896/jsjkx.241100177

• 高性能计算 • 上一篇    下一篇

AI+HPC:“智能+”驱动下的超算系统软件及应用技术发展综述

谭政源, 钟佳卿, 陈娟   

  1. 国防科技大学计算机学院 长沙 410073
  • 收稿日期:2024-11-28 修回日期:2025-03-02 出版日期:2025-05-15 发布日期:2025-05-12
  • 通讯作者: 陈娟(juanchen@nudt.edu.cn)
  • 作者简介:(tanzhengyuan@nudt.edu.cn)
  • 基金资助:
    并行与分布计算全国重点实验室基金项目(2023-KJWPDL-01)

AI+HPC:An Overview of Supercomputing System Software and Application Technology Development Driven by “AI+”

TAN Zhengyuan, ZHONG Jiaqing, CHEN Juan   

  1. College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China
  • Received:2024-11-28 Revised:2025-03-02 Online:2025-05-15 Published:2025-05-12
  • About author:TAN Zhengyuan,born in 2002,postgraduate.His main research interests include high performance computing and so on.
    CHEN Juan,born in 1980,Ph.D,professor.Her main research interests include high performance computing,low-po-wer compiler and power management.
  • Supported by:
    Open Fund of National Key Laboratory of Parallel and Distributed Computing(PDL)(2023-KJWPDL-01).

摘要: 人工智能(AI)和高性能计算(HPC)是计算机领域的两大重要技术。随着计算机技术的飞速发展,二者的联系逐渐紧密,并呈现出互相依赖、互相促进的关系。一方面,高性能计算系统面临的各种新问题与新挑战,需要人工智能方法技术辅助解决(AI for HPC);另一方面,人工智能领域理论的突破,依赖于HPC提供的强大的计算能力(HPC for AI)。在这样的背景下,AI和HPC两领域交叉融合,深度发展。文中系统回顾了近年来AI和HPC两个领域各自技术的发展脉络,着重从以下几方面展开分析:1)AI技术在解决HPC硬件体系结构、操作系统资源管理、编译优化和软件开发等几个方面问题的贡献;2)HPC为AI在硬件基础设施及软件应用上的支持;3)AI和HPC领域融合的未来发展前景与挑战。

关键词: 人工智能, 高性能计算, 领域融合, 硬件体系, 软件应用

Abstract: Artificial Intelligence(AI) and High Performance Computing(HPC) are two essential technologies in computer science.With the rapid development of computer science and technology,there has been a gradual trend of convergence and deve-lopment of AI and HPC.On the one hand,new challenges in high-performance computing systems require AI-powered solutions(AI for HPC).On the other hand,breakthroughs in artificial intelligence demand the support of high-performance computing(HPC for AI).Consequently,the convergence of AI and HPC strikes the development of core technologies in their respective fields.In this paper,we systematically review the respective technological development in the fields of AI and HPC in the past decade,focusing on three aspects:1)the role of AI technology in HPC hardware architecture,operating system resource management,compilation optimization,and software development,etc;2)the support of HPC for AI in terms of system hardware solutions and software applications;3)prospects and challenges for the future development of AI and HPC convergence.

Key words: Artificial intelligence, High performance computing, Domain convergence, Hardware architecture, Software application

中图分类号: 

  • TP302
[1]JOUBERT W,MESSER B,ROTH P C,et al.Learning to Scale the Summit:AI for Science on a Leadership Supercomputer[C]//2022 IEEE International Parallel and DistributedProces-sing Symposium Workshops(IPDPSW).2022:1246-1255.
[2]LYU T,SATO M,AOKI S,et al.CORTEX:Large-Scale Brain Simulator Utilizing Indegree Sub-Graph Decomposition on Fugaku Supercomputer[J].arXiv:2406.03762,2024.
[3]FANG J,FU H,ZHAO W,et al.swdnn:A library for accelerating deep learning applications on sunwaytaihulight[C]//2017 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).2017:615-624.
[4]ACCORDI G,GADIOLI D,PALERMO G,et al.Unlocking performance portability on LUMI-G supercomputer:A virtual screening case study[C]//Proceedings of the 12th International Workshop on OpenCL and SYCL.2024:1-4.
[5]LIU X,MCDUFF D,KOVACS G,et al.Large Language Models are Few-Shot Health Learners[J].arXiv:2305.15525,2023.
[6]SMITH S,PATWARY M,NORICK B,et al.Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B,A Large-Scale Generative Language Model[J].arXiv:2201.11990,2022.
[7]WANG Z,TANG Y,CHEN J,et al.Energy wall for exascalesupercomputing[J].Computing and Informatics,2016,35(4):941-962.
[8]WANG R,LU K,CHEN J,et al.Brief introduction of tianhe exascale prototype system[J].Tsinghua Science and Technology,2020,26(3):361-369.
[9]MIRHOSEINI A,GOLDIE A,YAZGAN M,et al.A graphplacement methodology for fast chip design[J].Nature,2021,594(7862):207-212.
[10]JEFFREY D.The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design[C]//ISSCC 2020.2020.
[11]LTAIEF H,HONG Y,DABAH A,et al.Steering CustomizedAI Architectures for HPC Scientific Applications[C]//International Conference on High Performance Computing.2023:125-143.
[12]AN W,BI X,CHEN G,et al.Fire-Flyer AI-HPC:A Cost-Effective Software-Hardware Co-Design for Deep Learning[J].ar-Xiv:2408.14158,2024.
[13]ALSADIE D.Advancements in heuristic task scheduling for IoT applications in fog-cloud computing:challenges and prospects[J].PeerJ Computer Science,2024,10:e2128.
[14]IFTIKHAR S,AHMAD M M M,TULI S,et al.HunterPlus:AI based energy-efficient task scheduling for cloud-fog computing environments[J].Internet of Things,2023,21:100667.
[15]FAN Y,LAN Z,CHILDERS T,et al.Deep reinforcement agent for scheduling in HPC[C]//2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS).2021:807-816.
[16]NARANTUYA J,SHIN J S,PARK S,et al.Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC/AI Converged Cluster[J].Computers,Materials & Continua,2022,72(3):4375-4395.
[17]ZHANG D,DAI D,HE Y,et al.RLScheduler:an automatedHPC batch job scheduler using reinforcement learning[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15.
[18]RANGANATH K,SUETTERLEIN J D,MANZANO J B,et al.MAPA:Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers[J].arXiv:2110.03214v1,2021.
[19]JUNG H H,PEDRAM M P.Supervised Learning Based Power Management for Multicore Processors[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2010,29(9):1395-1408.
[20]AKSAR B,SENCAN E,SCHWALLER B,et al.Prodigy:Towards unsupervised anomaly detection in production hpc systems[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-14.
[21]ISAKOV M,CURRIER M,DEL ROSARIO E,et al.A taxonomy of error sources in HPC I/O machine learning models[C]//SC '22:Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis.2022.
[22]LI P,GUO Y,LUO Y,et al.Graph neural networks based me-mory inefficiency detection using selective sampling[C]//SC'22:International Conference for High Performance Computing,Networking,Storage and Analysis.2022:1-14.
[23]BOIXADERAS I,ZIVANOVIC D,MORÉ S,et al.Cost-awareprediction of uncorrected DRAM errors in the field[C]//SC'20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15.
[24]XIAO Y,SONG Y,LIU J.Collaborative multi-agent deep reinforcement learning for energy-efficient resource allocation in heterogeneous mobile edge computing networks[J].IEEE Transactions on Wireless Communications,2024,23(6):6653-6668.
[25]GU B,ZHANG X,LIN Z,et al.Deep multiagent reinforcement-learning-based resource allocation for internet of controllable things[J].IEEE Internet of Things Journal,2020,8(5):3066-3074.
[26]ASHOURI A H,KILLIAN W,CAVAZOS J,et al.A survey on compiler autotuning using machine learning[J].ACM Computing Surveys(CSUR),2018,51(5):1-42.
[27]ASHOURI A H,MARIANI G,PALERMO G,et al.Cobayn:Compiler autotuning framework using bayesian networks[J].ACM Transactions on Architecture and Code Optimization(TACO),2016,13(2):1-25.
[28]FURSIN G,KASHNIKOV Y,MEMON A W,et al.Milepost gcc:Machine learning enabled self-tuning compiler[J].International journal of parallel programming,2011,39:296-327.
[29]MARTINS L G,NOBRE R,DELBEM A C,et al.Exploration of compiler optimization sequences using clustering-based selection[C]//Proceedings of the 2014 SIGPLAN/SIGBED Conference on Languages,Compilers and Tools for Embedded Systems.2014:63-72.
[30]ASHOURI A H,BIGNOLI A,PALERMO G,et al.Micomp:Mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning[J].ACM Transactions on Architecture and Code Optimization(TACO),2017,14(3):1-28.
[31]BENDIB N.Automatic Code Optimization in the MLIR Compi-ler Using Deep Reinforcement Learning[J/OL].(2024-07-27)[2024-11-01].http://dx.doi.org/10.13140/RG.2.2.17390.42569.
[32]MCGOVERN A,MOSS J.Scheduling straight-line code using reinforcement learning and rollouts[C]//Advances in Neural Information Processing Systems.1998.
[33]BUDIARDJA R D,BERRILL M,EISENBACH M,et al.Ready for the Frontier:Preparing Applications for the World's First Exascale System[C]//International Conference on High Performance Computing.2023:182-201.
[34]WATANABE K,NOSE T,SUZUKI K,et al.Application deve-lopment environment for supercomputer fugaku[EB/OL].fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article07.pdf.
[35]FU H,LIAO J,YANG J,et al.The Sunway TaihuLight super-computer:system and applications[J].Science China Information Sciences,2016,59:1-16.
[36]KOUSHA P,JAIN A,KOLLI A,et al.“Hey CAI”-Conversa-tional AI Enabled User Interface for HPC Tools[C]//International Conference on High Performance Computing.2022:87-108.
[37]KOUSHA P,JAIN A,KOLLI A,et al.SAI:AI-Enabled Speech Assistant Interface for Science Gateways in HPC[C]//International Conference on High Performance Computing.2023:402-424.
[38]NICHOLS D,MARATHE A,MENON H,et al.HPC-Coder:Modeling Parallel Programs using Large Language Models[C]//ISC High Performance 2024 Research Paper Proceedings(39th International Conference).2024:1-12.
[39]ROTEM N,FIX J,ABDULRASOOL S,et al.Glow:GraphLowering Compiler Techniques for Neural Networks[J].arXiv:1805.00907,2019.
[40]CHEN T,MOREAU T,JIANG Z,et al.TVM:An automatedEnd-to-End optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).2018:578-594.
[41]SIEMIENIUK A,CHELINI L,KHAN A A,et al.OCC:An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2022,41(6):1674-1686.
[42]BAVIKADI S,DHAVLLE A,GANGULY A,et al.A survey on machine learning accelerators and evolutionary hardware platforms[J].IEEE Design & Test,2022,39(3):91-116.
[43]YANG X J,LIAO X K,LU K,et al.The TianHe-1A supercomputer:its hardware and software[J].Journal of computer science and technology,2011,26(3):344-351.
[44]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]//NIPS 2012.2012.
[45]QIU J,WANG J,YAO S,et al.Going deeper with embedded FPGA platform for convolutional neural network[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field-programmable Gate Arrays.2016:26-35.
[46]SONG L,WANG Y,HAN Y,et al.C-Brain:A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization[C]//Proceedings of the 53rd Annual Design Automation Conference.2016:1-6.
[47]DEBOLE M,TABA B,AMIR A,et al.TrueNorth:Accelerating From Zero to 64 Million Neurons in 10 Years[J].Computer,2019,52(5):20-29.
[48]SUN C,DU C,LI X,et al.Research on key technologies of the Smart Computing Center[J].Communications management and technology,2024:33-37,52.
[49]DASH S,LYNGAAS I R,YIN J,et al.Optimizing distributed training on frontier for large language models[C]//ISC High Performance 2024 Research Paper Proceedings(39th International Conference).2024:1-11.
[50]NARAYANAN D,SHOEYBI M,CASPER J,et al.Efficientlarge-scale language model training on gpu clusters using megatron-lm[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-15.
[51]ISAEV M,MCDONALD N,DENNISON L,et al.Calculon:a methodology and tool for high-level co-design of systems and large language models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-14.
[52]JAIN A,AWAN A A,ALJUHANI A M,et al.GEMS:Gpu-enabled memory-aware model-parallelism system for distributed dnn training[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15.
[53]STEVENSON G A,JONES D,KIM H,et al.High-throughput virtual screening of small molecule inhibitors for SARS-CoV-2 protein targets with deep fusion models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-13.
[54]SUKUMAR S R,BALMA J A,RICKETT C D,et al.The convergence of HPC,ai and Big Data in rapid-response to the COVID-19 pandemic[C]//Smoky Mountains Computational Sciences and Engineering Conference.2021:157-172.
[55]KADIOGLU O,SAEED M,GRETEN H J,et al.Identification of novel compounds against three targets of SARS CoV-2 coronavirus by combined virtual screening and supervised machine learning[J].Computers in Biology and Medicine,2021,133:104359.
[56]SOOD S K,SANDHU R,SINGLA K,et al.IoT,big data and HPC based smart flood management framework[J].Sustainable Computing:Informatics and Systems,2018,20:102-117.
[57]ICHIMURA T,FUJITA K,YAMAGUCHI T,et al.A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing[C]//SC18:International Conference for High Performance Computing,Networking,Storage and Analysis.2018:627-637.
[58]BI K,XIE L,ZHANG H,et al.Accurate medium-range globalweather forecasting with 3D neural networks[J].Nature,2023,619(7970):533-538.
[59]MAULIK R,EGELE R,LUSCH B,et al.Recurrent neural network architecture search for geophysical emulation[C]//SC'20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-14.
[60]LI Y,JU X,XIAO Y,et al.Rapid simulations of atmospheric data assimilation of hourly-scale phenomena with modern neural networks[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-13.
[61]ZHAO X,LI M,XIAO Q,et al.Ai for quantum mechanics:High performance quantum many-body simulations via deep learning[C]//SC'22:International Conference for High Performance Computing,Networking,Storage and Analysis.2022:1-15.
[62]DAS S,KANUNGO B,SUBRAMANIAN V,et al.Large-scale materials modeling at quantum accuracy:Ab initio simulations of quasicrystals and interacting extended defects in metallic alloys[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-12.
[63]JHA S,PASCUZZI V R,TURILLI M.Ai-coupled hpc workflows[J].arXiv:2208.11745,2022.
[64]JIANG Z,LIN H,ZHONG Y,et al.MegaScale:Scaling largelanguage model training to more than 10,000 GPUs[C]//21st USENIX Symposium on Networked Systems Design and Implementation(NSDI 24).2024:745-760.
[65]PANDEY S K,SINGH K P,DHAR P,et al.Green Computing:Importance,Approaches,and Practices[M]//6G Connectivity-Systems,Technologies,and Applications.River Publishers,157-186.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!