计算机科学 ›› 2025, Vol. 52 ›› Issue (2): 261-267.doi: 10.11896/jsjkx.240200072

• 计算机网络 • 上一篇    下一篇

基于使用特性的两阶段多因素作业运行时间预测算法

尚秋言1, 李奕聪2,3, 温瑞林1,2, 马银萍1,2, 欧阳荣彬1, 樊春1,2   

  1. 1 北京大学计算中心 北京 100084
    2 北京大学长沙计算与数字经济研究院 长沙 410000
    3 电子科技大学计算机科学与工程学院 成都 610000
  • 收稿日期:2024-02-21 修回日期:2024-04-04 出版日期:2025-02-15 发布日期:2025-02-17
  • 通讯作者: 马银萍(mayinping@pku.edu.cn)
  • 作者简介:(shangqiuyan@stu.pku.edu.cn)
  • 基金资助:
    湖南省创新型省份建设专项资金(2023GK1010);北京大学高性能计算平台计算资源支持

Two-stage Multi-factor Algorithm for Job Runtime Prediction Based on Usage Characteristics

SHANG Qiuyan1, LI Yicong2,3, WEN Ruilin1,2, MA Yinping1,2, OUYANG Rongbin1, FAN Chun1,2   

  1. 1 Computer Center,Peking University,Beijing 100084,China
    2 Peking University Changsha Institute for Computing and Digital Economy,Changsha 410000,China
    3 College of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 610000,China
  • Received:2024-02-21 Revised:2024-04-04 Online:2025-02-15 Published:2025-02-17
  • About author:SHANG Qiuyan,born in 1999,postgra-duate.Her main research interests include high performance computing,scheduling algorithm and artificial intelligence.
    MA Yinping,born in 1993,postgra-duate,engineer,is a member of CCF(No.L2132M).Her main research interests include high performance computing and artificial intelligence.
  • Supported by:
    Special Funds for Construction of Innovative Provinces in Hunan Province(2023GK1010) and High-performance Computing Platform of Peking University for Providing Computational Resources.

摘要: 为了解决用户所提供的作业预计时间不准确对高性能计算平台调度系统的连锁影响,以鹤思调度系统为研究模板,提出了一种通用的两阶段多因素作业运行时间预测算法(TSMF)。TSMF融合了复杂的用户行为模式和作业上下文特征,以确保预测准确可靠,并能够无缝嵌入到大多数高性能计算平台的调度系统中,从而改善其性能。在北京大学高性能计算集群的数据集及真实调度系统上进行的多角度模拟实验显示,TSMF在预测准确性方面表现出色,能够在绝大部分作业上实现精准预测。例如,在多达60.8%的作业中,其预测误差在1min以内。此外,TSMF显著改进了实际情境中的调度算法,提高了资源利用率并大幅缩短了用户等待时间。

关键词: 高性能计算, 作业运行时间预测, 作业调度, 行为模式, 机器学习

Abstract: To address the chain impact of inaccurate user-set job runtime on the scheduling system of high-performance computing platforms,a versatile two-stage multi-factor(TSMF) algorithm for job runtime prediction is proposed.TSMF integrates intricate user behavior patterns and nuanced job contextual features to ensure accurate and reliable predictions.TSMF can seamlessly embed into the scheduling systems of most high-performance computing platforms,thereby enhancing their performance.The multi-angle simulation experiments on the dataset and real scheduling system of Peking University's high-performance computing clusters show that TSMF performs well in prediction accuracy and can achieve accurate prediction on most jobs.For example,in up to 60.8 % of jobs,the prediction error is as low as less than one minute.Furthermore,TSMF significantly enhances the sche-duling algorithms in practical scenarios,improving resource utilization and substantially reducing user waiting times.

Key words: High-performance computing, Job runtime prediction, Job scheduling, Behavior patterns, Machine learning

中图分类号: 

  • TP391
[1]LU P J,XIONG Z Y,LAI M C.Analysis of the current status of high-performance computing technology and standards[J].Computer Science,2023,50(11):1-7.
[2]ZRIGUI S,DE CAMARGO R Y,LEGRAND A,et al.Improving the performance of batch schedulers using online job run-time classification[J].Journal of Parallel and Distributed Computing,2022,164:83-95.
[3]MENEAR K,NAG A,PERR-SAUER J,et al.Mastering HPC Runtime Prediction:From Observing Patterns to a Methodological Approach[M]//Practice and Experience in Advanced Research Computing.2023:75-85.
[4]MARO R,BAI Y,BAHAR R I.Dynamically reconfiguring processor resources to reduce power consumption in high-perfor-mance processors[C]//Power-Aware Computer Systems:First International Workshop,PACS 2000 Cambridge,MA,USA,November 12,2000 Revised Papers 1.Springer Berlin Heidelberg,2001:97-111.
[5]FAN Y,LAN Z,CHILDERS T,et al.Deep reinforcement agent for scheduling in HPC[C]//2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2021:807-816.
[6]ZHANG D,DAI D,HE Y,et al.RLScheduler:an automatedHPC batch job scheduler using reinforcement learning[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-15.
[7]FAN Y,LAN Z.DRAS-CQSim:A reinforcement learning based framework for HPC cluster scheduling[J].Software Impacts,2021,8:100077.
[8]TSAFRIR D,ETSION Y,FEITELSON D G.Backfilling usingsystem-generated predictions rather than user runtime estimates[J].IEEE Transactions on Parallel and Distributed Systems,2007,18(6):789-803.
[9]PHINJAROENPHAN P,BEVINAKOPPA S,ZEEPHONG-SEKUL P.A method for estimating the execution time of a pa-rallel task on a grid node[C]//Advances in Grid Computing-EGC 2005:European Grid Conference,Amsterdam,The Netherlands,February 14-16,2005,Revised Selected Papers.Springer Berlin Heidelberg,2005:226-236.
[10]QUINLAN J R.Induction of decision trees[J].Machine Lear-ning,1986,1:81-106.
[11]SMOLA A J,SCHÖLKOPF B.A tutorial on support vector regression[J].Statistics and Computing,2004,14:199-222.
[12]ROSENBLATT F.The perceptron:a probabilistic model for information storage and organization in the brain[J].Psychological Review,1958,65(6):386.
[13]YU X X,WEI J W,ZHANG Z B,et al.Research on machine learning to predict job running time of school-level computing teaching platform[J].Soft Ware Guide,2023,22(11):104-109.
[14]CHEN X,LU C D,PATTABIRAMAN K.Predicting job completion times using system logs in supercomputing clusters[C]//2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop(DSN-W).IEEE,2013:1-8.
[15]BAUM L E,PETRIE T,SOULES G,et al.A maximizationtechnique occurring in the statistical analysis of probabilistic functions of Markov chains[J].The Annals of Mathematical Statistics,1970,41(1):164-171.
[16]WU G P,SHEN Y,ZHANG W S,et al.Job duration prediction for backfill optimization[J].Journal of Chinese Computer System,2019,40(1):6-12.
[17]ZHOU L F,YANG W Y,HAN Y G,et al.Job name hierarchical clustering algorithm predicts job running time[J].Journal of National University of Defense Technology/Guofang Keji Daxue Xuebao,2022,44(5):13-23.
[18]ZHOU L,ZHANG X,YANG W,et al.Prep:Predicting jobruntime with job running path on supercomputers[C]//Proceedings of the 50th International Conference on Parallel Processing.2021:1-10.
[19]CHEN T,GUESTRIN C.Xgboost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM sigkdd International Conference on Knowledge Discovery and Data Mining.2016:785-794.
[20]KE G,MENG Q,FINLEY T,et al.Lightgbm:A highly efficient gradient boosting decision tree[C]//Neural Information Processing Systems.Curran Associates Inc,2017.
[21]FRIEDMAN J H.Greedy function approximation:a gradientboosting machine[J].Annals of Statistics,2001:1189-1232.
[22]HU Q,SUN P,YAN S,et al.Characterization and prediction of deep learning workloads in large-scale gpu datacenters[C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.2021:1-15.
[23]BREIMAN L.Random forests[J].Machine Learning,2001,45:5-32.
[24]BOX G E P,JENKINS G M,REINSEL G C,et al.Time series analysis:forecasting and control[M].John Wiley & Sons,2015.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!