计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 1-10.doi: 10.11896/jsjkx.241100177
谭政源, 钟佳卿, 陈娟
TAN Zhengyuan, ZHONG Jiaqing, CHEN Juan
摘要: 人工智能(AI)和高性能计算(HPC)是计算机领域的两大重要技术。随着计算机技术的飞速发展,二者的联系逐渐紧密,并呈现出互相依赖、互相促进的关系。一方面,高性能计算系统面临的各种新问题与新挑战,需要人工智能方法技术辅助解决(AI for HPC);另一方面,人工智能领域理论的突破,依赖于HPC提供的强大的计算能力(HPC for AI)。在这样的背景下,AI和HPC两领域交叉融合,深度发展。文中系统回顾了近年来AI和HPC两个领域各自技术的发展脉络,着重从以下几方面展开分析:1)AI技术在解决HPC硬件体系结构、操作系统资源管理、编译优化和软件开发等几个方面问题的贡献;2)HPC为AI在硬件基础设施及软件应用上的支持;3)AI和HPC领域融合的未来发展前景与挑战。
中图分类号:
[1]JOUBERT W,MESSER B,ROTH P C,et al.Learning to Scale the Summit:AI for Science on a Leadership Supercomputer[C]//2022 IEEE International Parallel and DistributedProces-sing Symposium Workshops(IPDPSW).2022:1246-1255. [2]LYU T,SATO M,AOKI S,et al.CORTEX:Large-Scale Brain Simulator Utilizing Indegree Sub-Graph Decomposition on Fugaku Supercomputer[J].arXiv:2406.03762,2024. [3]FANG J,FU H,ZHAO W,et al.swdnn:A library for accelerating deep learning applications on sunwaytaihulight[C]//2017 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).2017:615-624. [4]ACCORDI G,GADIOLI D,PALERMO G,et al.Unlocking performance portability on LUMI-G supercomputer:A virtual screening case study[C]//Proceedings of the 12th International Workshop on OpenCL and SYCL.2024:1-4. [5]LIU X,MCDUFF D,KOVACS G,et al.Large Language Models are Few-Shot Health Learners[J].arXiv:2305.15525,2023. [6]SMITH S,PATWARY M,NORICK B,et al.Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B,A Large-Scale Generative Language Model[J].arXiv:2201.11990,2022. [7]WANG Z,TANG Y,CHEN J,et al.Energy wall for exascalesupercomputing[J].Computing and Informatics,2016,35(4):941-962. [8]WANG R,LU K,CHEN J,et al.Brief introduction of tianhe exascale prototype system[J].Tsinghua Science and Technology,2020,26(3):361-369. [9]MIRHOSEINI A,GOLDIE A,YAZGAN M,et al.A graphplacement methodology for fast chip design[J].Nature,2021,594(7862):207-212. [10]JEFFREY D.The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design[C]//ISSCC 2020.2020. [11]LTAIEF H,HONG Y,DABAH A,et al.Steering CustomizedAI Architectures for HPC Scientific Applications[C]//International Conference on High Performance Computing.2023:125-143. [12]AN W,BI X,CHEN G,et al.Fire-Flyer AI-HPC:A Cost-Effective Software-Hardware Co-Design for Deep Learning[J].ar-Xiv:2408.14158,2024. [13]ALSADIE D.Advancements in heuristic task scheduling for IoT applications in fog-cloud computing:challenges and prospects[J].PeerJ Computer Science,2024,10:e2128. [14]IFTIKHAR S,AHMAD M M M,TULI S,et al.HunterPlus:AI based energy-efficient task scheduling for cloud-fog computing environments[J].Internet of Things,2023,21:100667. [15]FAN Y,LAN Z,CHILDERS T,et al.Deep reinforcement agent for scheduling in HPC[C]//2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS).2021:807-816. [16]NARANTUYA J,SHIN J S,PARK S,et al.Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC/AI Converged Cluster[J].Computers,Materials & Continua,2022,72(3):4375-4395. [17]ZHANG D,DAI D,HE Y,et al.RLScheduler:an automatedHPC batch job scheduler using reinforcement learning[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15. [18]RANGANATH K,SUETTERLEIN J D,MANZANO J B,et al.MAPA:Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers[J].arXiv:2110.03214v1,2021. [19]JUNG H H,PEDRAM M P.Supervised Learning Based Power Management for Multicore Processors[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2010,29(9):1395-1408. [20]AKSAR B,SENCAN E,SCHWALLER B,et al.Prodigy:Towards unsupervised anomaly detection in production hpc systems[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-14. [21]ISAKOV M,CURRIER M,DEL ROSARIO E,et al.A taxonomy of error sources in HPC I/O machine learning models[C]//SC '22:Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis.2022. [22]LI P,GUO Y,LUO Y,et al.Graph neural networks based me-mory inefficiency detection using selective sampling[C]//SC'22:International Conference for High Performance Computing,Networking,Storage and Analysis.2022:1-14. [23]BOIXADERAS I,ZIVANOVIC D,MORÉ S,et al.Cost-awareprediction of uncorrected DRAM errors in the field[C]//SC'20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15. [24]XIAO Y,SONG Y,LIU J.Collaborative multi-agent deep reinforcement learning for energy-efficient resource allocation in heterogeneous mobile edge computing networks[J].IEEE Transactions on Wireless Communications,2024,23(6):6653-6668. [25]GU B,ZHANG X,LIN Z,et al.Deep multiagent reinforcement-learning-based resource allocation for internet of controllable things[J].IEEE Internet of Things Journal,2020,8(5):3066-3074. [26]ASHOURI A H,KILLIAN W,CAVAZOS J,et al.A survey on compiler autotuning using machine learning[J].ACM Computing Surveys(CSUR),2018,51(5):1-42. [27]ASHOURI A H,MARIANI G,PALERMO G,et al.Cobayn:Compiler autotuning framework using bayesian networks[J].ACM Transactions on Architecture and Code Optimization(TACO),2016,13(2):1-25. [28]FURSIN G,KASHNIKOV Y,MEMON A W,et al.Milepost gcc:Machine learning enabled self-tuning compiler[J].International journal of parallel programming,2011,39:296-327. [29]MARTINS L G,NOBRE R,DELBEM A C,et al.Exploration of compiler optimization sequences using clustering-based selection[C]//Proceedings of the 2014 SIGPLAN/SIGBED Conference on Languages,Compilers and Tools for Embedded Systems.2014:63-72. [30]ASHOURI A H,BIGNOLI A,PALERMO G,et al.Micomp:Mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning[J].ACM Transactions on Architecture and Code Optimization(TACO),2017,14(3):1-28. [31]BENDIB N.Automatic Code Optimization in the MLIR Compi-ler Using Deep Reinforcement Learning[J/OL].(2024-07-27)[2024-11-01].http://dx.doi.org/10.13140/RG.2.2.17390.42569. [32]MCGOVERN A,MOSS J.Scheduling straight-line code using reinforcement learning and rollouts[C]//Advances in Neural Information Processing Systems.1998. [33]BUDIARDJA R D,BERRILL M,EISENBACH M,et al.Ready for the Frontier:Preparing Applications for the World's First Exascale System[C]//International Conference on High Performance Computing.2023:182-201. [34]WATANABE K,NOSE T,SUZUKI K,et al.Application deve-lopment environment for supercomputer fugaku[EB/OL].fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article07.pdf. [35]FU H,LIAO J,YANG J,et al.The Sunway TaihuLight super-computer:system and applications[J].Science China Information Sciences,2016,59:1-16. [36]KOUSHA P,JAIN A,KOLLI A,et al.“Hey CAI”-Conversa-tional AI Enabled User Interface for HPC Tools[C]//International Conference on High Performance Computing.2022:87-108. [37]KOUSHA P,JAIN A,KOLLI A,et al.SAI:AI-Enabled Speech Assistant Interface for Science Gateways in HPC[C]//International Conference on High Performance Computing.2023:402-424. [38]NICHOLS D,MARATHE A,MENON H,et al.HPC-Coder:Modeling Parallel Programs using Large Language Models[C]//ISC High Performance 2024 Research Paper Proceedings(39th International Conference).2024:1-12. [39]ROTEM N,FIX J,ABDULRASOOL S,et al.Glow:GraphLowering Compiler Techniques for Neural Networks[J].arXiv:1805.00907,2019. [40]CHEN T,MOREAU T,JIANG Z,et al.TVM:An automatedEnd-to-End optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).2018:578-594. [41]SIEMIENIUK A,CHELINI L,KHAN A A,et al.OCC:An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2022,41(6):1674-1686. [42]BAVIKADI S,DHAVLLE A,GANGULY A,et al.A survey on machine learning accelerators and evolutionary hardware platforms[J].IEEE Design & Test,2022,39(3):91-116. [43]YANG X J,LIAO X K,LU K,et al.The TianHe-1A supercomputer:its hardware and software[J].Journal of computer science and technology,2011,26(3):344-351. [44]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]//NIPS 2012.2012. [45]QIU J,WANG J,YAO S,et al.Going deeper with embedded FPGA platform for convolutional neural network[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field-programmable Gate Arrays.2016:26-35. [46]SONG L,WANG Y,HAN Y,et al.C-Brain:A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization[C]//Proceedings of the 53rd Annual Design Automation Conference.2016:1-6. [47]DEBOLE M,TABA B,AMIR A,et al.TrueNorth:Accelerating From Zero to 64 Million Neurons in 10 Years[J].Computer,2019,52(5):20-29. [48]SUN C,DU C,LI X,et al.Research on key technologies of the Smart Computing Center[J].Communications management and technology,2024:33-37,52. [49]DASH S,LYNGAAS I R,YIN J,et al.Optimizing distributed training on frontier for large language models[C]//ISC High Performance 2024 Research Paper Proceedings(39th International Conference).2024:1-11. [50]NARAYANAN D,SHOEYBI M,CASPER J,et al.Efficientlarge-scale language model training on gpu clusters using megatron-lm[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-15. [51]ISAEV M,MCDONALD N,DENNISON L,et al.Calculon:a methodology and tool for high-level co-design of systems and large language models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-14. [52]JAIN A,AWAN A A,ALJUHANI A M,et al.GEMS:Gpu-enabled memory-aware model-parallelism system for distributed dnn training[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-15. [53]STEVENSON G A,JONES D,KIM H,et al.High-throughput virtual screening of small molecule inhibitors for SARS-CoV-2 protein targets with deep fusion models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-13. [54]SUKUMAR S R,BALMA J A,RICKETT C D,et al.The convergence of HPC,ai and Big Data in rapid-response to the COVID-19 pandemic[C]//Smoky Mountains Computational Sciences and Engineering Conference.2021:157-172. [55]KADIOGLU O,SAEED M,GRETEN H J,et al.Identification of novel compounds against three targets of SARS CoV-2 coronavirus by combined virtual screening and supervised machine learning[J].Computers in Biology and Medicine,2021,133:104359. [56]SOOD S K,SANDHU R,SINGLA K,et al.IoT,big data and HPC based smart flood management framework[J].Sustainable Computing:Informatics and Systems,2018,20:102-117. [57]ICHIMURA T,FUJITA K,YAMAGUCHI T,et al.A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing[C]//SC18:International Conference for High Performance Computing,Networking,Storage and Analysis.2018:627-637. [58]BI K,XIE L,ZHANG H,et al.Accurate medium-range globalweather forecasting with 3D neural networks[J].Nature,2023,619(7970):533-538. [59]MAULIK R,EGELE R,LUSCH B,et al.Recurrent neural network architecture search for geophysical emulation[C]//SC'20:International Conference for High Performance Computing,Networking,Storage and Analysis.2020:1-14. [60]LI Y,JU X,XIAO Y,et al.Rapid simulations of atmospheric data assimilation of hourly-scale phenomena with modern neural networks[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-13. [61]ZHAO X,LI M,XIAO Q,et al.Ai for quantum mechanics:High performance quantum many-body simulations via deep learning[C]//SC'22:International Conference for High Performance Computing,Networking,Storage and Analysis.2022:1-15. [62]DAS S,KANUNGO B,SUBRAMANIAN V,et al.Large-scale materials modeling at quantum accuracy:Ab initio simulations of quasicrystals and interacting extended defects in metallic alloys[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2023:1-12. [63]JHA S,PASCUZZI V R,TURILLI M.Ai-coupled hpc workflows[J].arXiv:2208.11745,2022. [64]JIANG Z,LIN H,ZHONG Y,et al.MegaScale:Scaling largelanguage model training to more than 10,000 GPUs[C]//21st USENIX Symposium on Networked Systems Design and Implementation(NSDI 24).2024:745-760. [65]PANDEY S K,SINGH K P,DHAR P,et al.Green Computing:Importance,Approaches,and Practices[M]//6G Connectivity-Systems,Technologies,and Applications.River Publishers,157-186. |
|