计算机科学 ›› 2025, Vol. 52 ›› Issue (2): 261-267.doi: 10.11896/jsjkx.240200072
尚秋言1, 李奕聪2,3, 温瑞林1,2, 马银萍1,2, 欧阳荣彬1, 樊春1,2
SHANG Qiuyan1, LI Yicong2,3, WEN Ruilin1,2, MA Yinping1,2, OUYANG Rongbin1, FAN Chun1,2
摘要: 为了解决用户所提供的作业预计时间不准确对高性能计算平台调度系统的连锁影响,以鹤思调度系统为研究模板,提出了一种通用的两阶段多因素作业运行时间预测算法(TSMF)。TSMF融合了复杂的用户行为模式和作业上下文特征,以确保预测准确可靠,并能够无缝嵌入到大多数高性能计算平台的调度系统中,从而改善其性能。在北京大学高性能计算集群的数据集及真实调度系统上进行的多角度模拟实验显示,TSMF在预测准确性方面表现出色,能够在绝大部分作业上实现精准预测。例如,在多达60.8%的作业中,其预测误差在1min以内。此外,TSMF显著改进了实际情境中的调度算法,提高了资源利用率并大幅缩短了用户等待时间。
中图分类号:
[1]LU P J,XIONG Z Y,LAI M C.Analysis of the current status of high-performance computing technology and standards[J].Computer Science,2023,50(11):1-7. [2]ZRIGUI S,DE CAMARGO R Y,LEGRAND A,et al.Improving the performance of batch schedulers using online job run-time classification[J].Journal of Parallel and Distributed Computing,2022,164:83-95. [3]MENEAR K,NAG A,PERR-SAUER J,et al.Mastering HPC Runtime Prediction:From Observing Patterns to a Methodological Approach[M]//Practice and Experience in Advanced Research Computing.2023:75-85. [4]MARO R,BAI Y,BAHAR R I.Dynamically reconfiguring processor resources to reduce power consumption in high-perfor-mance processors[C]//Power-Aware Computer Systems:First International Workshop,PACS 2000 Cambridge,MA,USA,November 12,2000 Revised Papers 1.Springer Berlin Heidelberg,2001:97-111. [5]FAN Y,LAN Z,CHILDERS T,et al.Deep reinforcement agent for scheduling in HPC[C]//2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2021:807-816. [6]ZHANG D,DAI D,HE Y,et al.RLScheduler:an automatedHPC batch job scheduler using reinforcement learning[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-15. [7]FAN Y,LAN Z.DRAS-CQSim:A reinforcement learning based framework for HPC cluster scheduling[J].Software Impacts,2021,8:100077. [8]TSAFRIR D,ETSION Y,FEITELSON D G.Backfilling usingsystem-generated predictions rather than user runtime estimates[J].IEEE Transactions on Parallel and Distributed Systems,2007,18(6):789-803. [9]PHINJAROENPHAN P,BEVINAKOPPA S,ZEEPHONG-SEKUL P.A method for estimating the execution time of a pa-rallel task on a grid node[C]//Advances in Grid Computing-EGC 2005:European Grid Conference,Amsterdam,The Netherlands,February 14-16,2005,Revised Selected Papers.Springer Berlin Heidelberg,2005:226-236. [10]QUINLAN J R.Induction of decision trees[J].Machine Lear-ning,1986,1:81-106. [11]SMOLA A J,SCHÖLKOPF B.A tutorial on support vector regression[J].Statistics and Computing,2004,14:199-222. [12]ROSENBLATT F.The perceptron:a probabilistic model for information storage and organization in the brain[J].Psychological Review,1958,65(6):386. [13]YU X X,WEI J W,ZHANG Z B,et al.Research on machine learning to predict job running time of school-level computing teaching platform[J].Soft Ware Guide,2023,22(11):104-109. [14]CHEN X,LU C D,PATTABIRAMAN K.Predicting job completion times using system logs in supercomputing clusters[C]//2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop(DSN-W).IEEE,2013:1-8. [15]BAUM L E,PETRIE T,SOULES G,et al.A maximizationtechnique occurring in the statistical analysis of probabilistic functions of Markov chains[J].The Annals of Mathematical Statistics,1970,41(1):164-171. [16]WU G P,SHEN Y,ZHANG W S,et al.Job duration prediction for backfill optimization[J].Journal of Chinese Computer System,2019,40(1):6-12. [17]ZHOU L F,YANG W Y,HAN Y G,et al.Job name hierarchical clustering algorithm predicts job running time[J].Journal of National University of Defense Technology/Guofang Keji Daxue Xuebao,2022,44(5):13-23. [18]ZHOU L,ZHANG X,YANG W,et al.Prep:Predicting jobruntime with job running path on supercomputers[C]//Proceedings of the 50th International Conference on Parallel Processing.2021:1-10. [19]CHEN T,GUESTRIN C.Xgboost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM sigkdd International Conference on Knowledge Discovery and Data Mining.2016:785-794. [20]KE G,MENG Q,FINLEY T,et al.Lightgbm:A highly efficient gradient boosting decision tree[C]//Neural Information Processing Systems.Curran Associates Inc,2017. [21]FRIEDMAN J H.Greedy function approximation:a gradientboosting machine[J].Annals of Statistics,2001:1189-1232. [22]HU Q,SUN P,YAN S,et al.Characterization and prediction of deep learning workloads in large-scale gpu datacenters[C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.2021:1-15. [23]BREIMAN L.Random forests[J].Machine Learning,2001,45:5-32. [24]BOX G E P,JENKINS G M,REINSEL G C,et al.Time series analysis:forecasting and control[M].John Wiley & Sons,2015. |
|