AI+HPC:“智能+”驱动下的超算系统软件及应用技术发展综述

doi:10.11896/jsjkx.241100177

Abstract

Abstract: Artificial Intelligence(AI) and High Performance Computing(HPC) are two essential technologies in computer science.With the rapid development of computer science and technology,there has been a gradual trend of convergence and deve-lopment of AI and HPC.On the one hand,new challenges in high-performance computing systems require AI-powered solutions(AI for HPC).On the other hand,breakthroughs in artificial intelligence demand the support of high-performance computing(HPC for AI).Consequently,the convergence of AI and HPC strikes the development of core technologies in their respective fields.In this paper,we systematically review the respective technological development in the fields of AI and HPC in the past decade,focusing on three aspects:1)the role of AI technology in HPC hardware architecture,operating system resource management,compilation optimization,and software development,etc;2)the support of HPC for AI in terms of system hardware solutions and software applications;3)prospects and challenges for the future development of AI and HPC convergence.

Key words: Artificial intelligence, High performance computing, Domain convergence, Hardware architecture, Software application

CLC Number:

TP302

TAN Zhengyuan, ZHONG Jiaqing, CHEN Juan. AI+HPC:An Overview of Supercomputing System Software and Application Technology Development Driven by “AI+”[J].Computer Science, 2025, 52(5): 1-10.

References

[1]JOUBERT W,MESSER B,ROTH P C,et al.Learning to Scale the Summit:AI for Science on a Leadership Supercomputer[C]//2022 IEEE International Parallel and DistributedProces-sing Symposium Workshops(IPDPSW).2022:1246-1255.
[2]LYU T,SATO M,AOKI S,et al.CORTEX:Large-Scale Brain Simulator Utilizing Indegree Sub-Graph Decomposition on Fugaku Supercomputer[J].arXiv:2406.03762,2024.
[3]FANG J,FU H,ZHAO W,et al.swdnn:A library for accelerating deep learning applications on sunwaytaihulight[C]//2017 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).2017:615-624.
[4]ACCORDI G,GADIOLI D,PALERMO G,et al.Unlocking performance portability on LUMI-G supercomputer:A virtual screening case study[C]//Proceedings of the 12th International Workshop on OpenCL and SYCL.2024:1-4.
[5]LIU X,MCDUFF D,KOVACS G,et al.Large Language Models are Few-Shot Health Learners[J].arXiv:2305.15525,2023.
[6]SMITH S,PATWARY M,NORICK B,et al.Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B,A Large-Scale Generative Language Model[J].arXiv:2201.11990,2022.
[7]WANG Z,TANG Y,CHEN J,et al.Energy wall for exascalesupercomputing[J].Computing and Informatics,2016,35(4):941-962.
[8]WANG R,LU K,CHEN J,et al.Brief introduction of tianhe exascale prototype system[J].Tsinghua Science and Technology,2020,26(3):361-369.
[9]MIRHOSEINI A,GOLDIE A,YAZGAN M,et al.A graphplacement methodology for fast chip design[J].Nature,2021,594(7862):207-212.
[10]JEFFREY D.The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design[C]//ISSCC 2020.2020.
[11]LTAIEF H,HONG Y,DABAH A,et al.Steering CustomizedAI Architectures for HPC Scientific Applications[C]//International Conference on High Performance Computing.2023:125-143.
[12]AN W,BI X,CHEN G,et al.Fire-Flyer AI-HPC:A Cost-Effective Software-Hardware Co-Design for Deep Learning[J].ar-Xiv:2408.14158,2024.
[13]ALSADIE D.Advancements in heuristic task scheduling for IoT applications in fog-cloud computing:challenges and prospects[J].PeerJ Computer Science,2024,10:e2128.
[14]IFTIKHAR S,AHMAD M M M,TULI S,et al.HunterPlus:AI based energy-efficient task scheduling for cloud-fog computing environments[J].Internet of Things,2023,21:100667.
[15]FAN Y,LAN Z,CHILDERS T,et al.Deep reinforcement agent for scheduling in HPC[C]//2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS).2021:807-816.
[16]NARANTUYA J,SHIN J S,PARK S,et al.Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC/AI Converged Cluster[J].Computers,Materials & Continua,2022,72(3):4375-4395.
[17]ZHANG D,DAI D,HE Y,et al.RLScheduler:an automatedHPC batch job scheduler using reinforcement learning[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15.
[18]RANGANATH K,SUETTERLEIN J D,MANZANO J B,et al.MAPA:Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers[J].arXiv:2110.03214v1,2021.
[19]JUNG H H,PEDRAM M P.Supervised Learning Based Power Management for Multicore Processors[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2010,29(9):1395-1408.
[20]AKSAR B,SENCAN E,SCHWALLER B,et al.Prodigy:Towards unsupervised anomaly detection in production hpc systems[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-14.
[21]ISAKOV M,CURRIER M,DEL ROSARIO E,et al.A taxonomy of error sources in HPC I/O machine learning models[C]//SC '22:Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis.2022.
[22]LI P,GUO Y,LUO Y,et al.Graph neural networks based me-mory inefficiency detection using selective sampling[C]//SC'22:International Conference for High Performance Computing,Networking,Storage and Analysis.2022:1-14.
[23]BOIXADERAS I,ZIVANOVIC D,MORÉ S,et al.Cost-awareprediction of uncorrected DRAM errors in the field[C]//SC'20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15.
[24]XIAO Y,SONG Y,LIU J.Collaborative multi-agent deep reinforcement learning for energy-efficient resource allocation in heterogeneous mobile edge computing networks[J].IEEE Transactions on Wireless Communications,2024,23(6):6653-6668.
[25]GU B,ZHANG X,LIN Z,et al.Deep multiagent reinforcement-learning-based resource allocation for internet of controllable things[J].IEEE Internet of Things Journal,2020,8(5):3066-3074.
[26]ASHOURI A H,KILLIAN W,CAVAZOS J,et al.A survey on compiler autotuning using machine learning[J].ACM Computing Surveys(CSUR),2018,51(5):1-42.
[27]ASHOURI A H,MARIANI G,PALERMO G,et al.Cobayn:Compiler autotuning framework using bayesian networks[J].ACM Transactions on Architecture and Code Optimization(TACO),2016,13(2):1-25.
[28]FURSIN G,KASHNIKOV Y,MEMON A W,et al.Milepost gcc:Machine learning enabled self-tuning compiler[J].International journal of parallel programming,2011,39:296-327.
[29]MARTINS L G,NOBRE R,DELBEM A C,et al.Exploration of compiler optimization sequences using clustering-based selection[C]//Proceedings of the 2014 SIGPLAN/SIGBED Conference on Languages,Compilers and Tools for Embedded Systems.2014:63-72.
[30]ASHOURI A H,BIGNOLI A,PALERMO G,et al.Micomp:Mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning[J].ACM Transactions on Architecture and Code Optimization(TACO),2017,14(3):1-28.
[31]BENDIB N.Automatic Code Optimization in the MLIR Compi-ler Using Deep Reinforcement Learning[J/OL].(2024-07-27)[2024-11-01].http://dx.doi.org/10.13140/RG.2.2.17390.42569.
[32]MCGOVERN A,MOSS J.Scheduling straight-line code using reinforcement learning and rollouts[C]//Advances in Neural Information Processing Systems.1998.
[33]BUDIARDJA R D,BERRILL M,EISENBACH M,et al.Ready for the Frontier:Preparing Applications for the World's First Exascale System[C]//International Conference on High Performance Computing.2023:182-201.
[34]WATANABE K,NOSE T,SUZUKI K,et al.Application deve-lopment environment for supercomputer fugaku[EB/OL].fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article07.pdf.
[35]FU H,LIAO J,YANG J,et al.The Sunway TaihuLight super-computer:system and applications[J].Science China Information Sciences,2016,59:1-16.
[36]KOUSHA P,JAIN A,KOLLI A,et al.“Hey CAI”-Conversa-tional AI Enabled User Interface for HPC Tools[C]//International Conference on High Performance Computing.2022:87-108.
[37]KOUSHA P,JAIN A,KOLLI A,et al.SAI:AI-Enabled Speech Assistant Interface for Science Gateways in HPC[C]//International Conference on High Performance Computing.2023:402-424.
[38]NICHOLS D,MARATHE A,MENON H,et al.HPC-Coder:Modeling Parallel Programs using Large Language Models[C]//ISC High Performance 2024 Research Paper Proceedings(39th International Conference).2024:1-12.
[39]ROTEM N,FIX J,ABDULRASOOL S,et al.Glow:GraphLowering Compiler Techniques for Neural Networks[J].arXiv:1805.00907,2019.
[40]CHEN T,MOREAU T,JIANG Z,et al.TVM:An automatedEnd-to-End optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).2018:578-594.
[41]SIEMIENIUK A,CHELINI L,KHAN A A,et al.OCC:An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2022,41(6):1674-1686.
[42]BAVIKADI S,DHAVLLE A,GANGULY A,et al.A survey on machine learning accelerators and evolutionary hardware platforms[J].IEEE Design & Test,2022,39(3):91-116.
[43]YANG X J,LIAO X K,LU K,et al.The TianHe-1A supercomputer:its hardware and software[J].Journal of computer science and technology,2011,26(3):344-351.
[44]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]//NIPS 2012.2012.
[45]QIU J,WANG J,YAO S,et al.Going deeper with embedded FPGA platform for convolutional neural network[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field-programmable Gate Arrays.2016:26-35.
[46]SONG L,WANG Y,HAN Y,et al.C-Brain:A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization[C]//Proceedings of the 53rd Annual Design Automation Conference.2016:1-6.
[47]DEBOLE M,TABA B,AMIR A,et al.TrueNorth:Accelerating From Zero to 64 Million Neurons in 10 Years[J].Computer,2019,52(5):20-29.
[48]SUN C,DU C,LI X,et al.Research on key technologies of the Smart Computing Center[J].Communications management and technology,2024:33-37,52.
[49]DASH S,LYNGAAS I R,YIN J,et al.Optimizing distributed training on frontier for large language models[C]//ISC High Performance 2024 Research Paper Proceedings(39th International Conference).2024:1-11.
[50]NARAYANAN D,SHOEYBI M,CASPER J,et al.Efficientlarge-scale language model training on gpu clusters using megatron-lm[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-15.
[51]ISAEV M,MCDONALD N,DENNISON L,et al.Calculon:a methodology and tool for high-level co-design of systems and large language models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-14.
[52]JAIN A,AWAN A A,ALJUHANI A M,et al.GEMS:Gpu-enabled memory-aware model-parallelism system for distributed dnn training[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15.
[53]STEVENSON G A,JONES D,KIM H,et al.High-throughput virtual screening of small molecule inhibitors for SARS-CoV-2 protein targets with deep fusion models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-13.
[54]SUKUMAR S R,BALMA J A,RICKETT C D,et al.The convergence of HPC,ai and Big Data in rapid-response to the COVID-19 pandemic[C]//Smoky Mountains Computational Sciences and Engineering Conference.2021:157-172.
[55]KADIOGLU O,SAEED M,GRETEN H J,et al.Identification of novel compounds against three targets of SARS CoV-2 coronavirus by combined virtual screening and supervised machine learning[J].Computers in Biology and Medicine,2021,133:104359.
[56]SOOD S K,SANDHU R,SINGLA K,et al.IoT,big data and HPC based smart flood management framework[J].Sustainable Computing:Informatics and Systems,2018,20:102-117.
[57]ICHIMURA T,FUJITA K,YAMAGUCHI T,et al.A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing[C]//SC18:International Conference for High Performance Computing,Networking,Storage and Analysis.2018:627-637.
[58]BI K,XIE L,ZHANG H,et al.Accurate medium-range globalweather forecasting with 3D neural networks[J].Nature,2023,619(7970):533-538.
[59]MAULIK R,EGELE R,LUSCH B,et al.Recurrent neural network architecture search for geophysical emulation[C]//SC'20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-14.
[60]LI Y,JU X,XIAO Y,et al.Rapid simulations of atmospheric data assimilation of hourly-scale phenomena with modern neural networks[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-13.
[61]ZHAO X,LI M,XIAO Q,et al.Ai for quantum mechanics:High performance quantum many-body simulations via deep learning[C]//SC'22:International Conference for High Performance Computing,Networking,Storage and Analysis.2022:1-15.
[62]DAS S,KANUNGO B,SUBRAMANIAN V,et al.Large-scale materials modeling at quantum accuracy:Ab initio simulations of quasicrystals and interacting extended defects in metallic alloys[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-12.
[63]JHA S,PASCUZZI V R,TURILLI M.Ai-coupled hpc workflows[J].arXiv:2208.11745,2022.
[64]JIANG Z,LIN H,ZHONG Y,et al.MegaScale:Scaling largelanguage model training to more than 10,000 GPUs[C]//21st USENIX Symposium on Networked Systems Design and Implementation(NSDI 24).2024:745-760.
[65]PANDEY S K,SINGH K P,DHAR P,et al.Green Computing:Importance,Approaches,and Practices[M]//6G Connectivity-Systems,Technologies,and Applications.River Publishers,157-186.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

AI+HPC:An Overview of Supercomputing System Software and Application Technology Development Driven by “AI+”

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	LIAO Qiucheng, ZHOU Yang, LIN Xinhua. Metrics and Tools for Evaluating the Deviation in Parallel Timing [J]. Computer Science, 2025, 52(5): 41-49.
[2]	WANG Yifei, ZHANG Shengjie, XUE Dizhan, QIAN Shengsheng. Self-supervised Backdoor Attack Defence Method Based on Poisoned Classifier [J]. Computer Science, 2025, 52(4): 336-342.
[3]	WANG Yuan, HUO Peng, HAN Yi, CHEN Tun, WANG Xiang, WEN Hui. Survey on Deep Learning-based Meteorological Forecasting Models [J]. Computer Science, 2025, 52(3): 112-126.
[4]	ZHANG Tao. Coherent Legal Governance of Synthetic Data in AI Training [J]. Computer Science, 2025, 52(2): 20-32.
[5]	YAN Xiaoting, WANG Xiaoning, DONG Sheng, ZHAO Yining, XIAO Haili. Review on the Development and Application of Checkpointing Technology in High-performanceComputing [J]. Computer Science, 2024, 51(9): 1-14.
[6]	CHEN Yiyang, WANG Xiaoning, YAN Xiaoting, LI Guanlong ZHAO Yining, LU Shasha, XIAO Haili. Study on High Performance Computing Container Checkpoint Technology Based on CRIU [J]. Computer Science, 2024, 51(9): 40-50.
[7]	JIANG Rui, YANG Kaihui, WANG Xiaoming, LI Dapeng, XU Youyun. Attentional Interaction-based Deep Learning Model for Chinese Question Answering [J]. Computer Science, 2024, 51(6): 325-330.
[8]	GUO Shangzhi, LIAO Xiaofeng, XIAN Kaiyi. Logical Regression Click Prediction Algorithm Based on Combination Structure [J]. Computer Science, 2024, 51(2): 73-78.
[9]	WANG Wentong, ZHANG Zhijun, ZHANG Mingyang. Review of Key Technologies,Research Progress and Applications of Metaverse [J]. Computer Science, 2024, 51(12): 2-11.
[10]	RAO Yi, YUAN Bochuan, YUAN Yubo. Recognition Method of Online Classroom Interaction Based on Learner State [J]. Computer Science, 2024, 51(11A): 231200133-9.
[11]	XU Jun, ZHOU Peijin, ZHANG Haijing, ZHANG Hao, XU Yuzhong. Analysis of User Evaluation Indicator for AIGC Digital Illustration Design Principles [J]. Computer Science, 2024, 51(11): 47-53.
[12]	WANG Shuaiwei, LEI Jie, FENG Zunlei, LIANG Ronghua. Review of Visual Representation Learning [J]. Computer Science, 2024, 51(11): 112-132.
[13]	YAO Tianlei, CHEN Xiliang, YU Peiyi. Review of Generative Reinforcement Learning Based on Sequence Modeling [J]. Computer Science, 2024, 51(11): 213-228.
[14]	ZHANG Ce, CHU Dianhui, ZHANG Qiao, LIU Peng, WEI Meng, LIU Xiaoying. Metaverse Teaching:A Higher Form of Digital Teaching Transformation in Higher Education [J]. Computer Science, 2024, 51(10): 1-9.
[15]	DING Weilong, LIU Jinlong, ZHU Wei, LIAO Wanyin. Review of Quality Control Algorithms for Pathological Slides Based on Deep Learning [J]. Computer Science, 2024, 51(10): 276-286.