计算机科学 ›› 2017, Vol. 44 ›› Issue (10): 85-90.doi: 10.11896/j.issn.1002-137X.2017.10.016

• 2016 全国高性能计算学术年会 • 上一篇    下一篇

大型高能物理计算集群资源管理方法的评测

孙震宇,石京燕,姜晓巍,邹佳恒,杜然   

  1. 中国科学院高能物理研究所 北京100049,中国科学院高能物理研究所 北京100049,中国科学院高能物理研究所 北京100049,中国科学院高能物理研究所 北京100049,中国科学院高能物理研究所 北京100049
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目(11475210)资助

Evaluation of Resource Management Methods for Large High Energy Physics Computer Cluster

SUN Zhen-yu, SHI Jing-yan, JIANG Xiao-wei, ZOU Jia-heng and DU Ran   

  • Online:2018-12-01 Published:2018-12-01

摘要: 高能物理数据由物理事例组成,事例之间没有相关性。可以通过大量作业同时处理大量不同的数据文件,从而实现高能物理计算任务的并行化,因此高能物理计算是典型的高吞吐量计算场景。高能所计算集群使用开源的TORQUE/Maui进行资源管理及作业调度,并通过将集群资源划分成不同队列以及限制用户最大运行作业数来保证公平性,然而这也导致了集群整体资源利用率非常低下。SLURM和HTCondor都是近年来流行的开源资源管理系统,前者拥有丰富的作业调度策略,后者非常适合高吞吐量计算,二者都能够替代老旧、缺乏维护的TORQUE/Maui,都是管理计算集群资源的可行方案。在SLURM和HTCondor测试集群上模拟大亚湾实验用户的作业提交行为,对SLURM和HTCondor的资源分配行为和效率进行了测试,并与相同作业在高能物理研究所TORQUE/Maui集群上的实际调度结果进行了对比,分析了SLURM及HTCondor的优势和不足,探讨了使用SLURM或HTCondor管理高能物理研究所计算集群的可行性。

关键词: 资源管理系统,作业调度器,计算集群,高吞吐量计算,高能物理计算

Abstract: High energy physics data consist of multiple events,among which there is no relativity.A high energy phy-sics computing mission is parallelized by running multiple jobs processing multiple different data files simultaneously.Therefore,high energy physics computing is a typical high throughput computing scenario.The computer cluster running at the institute of high energy physics (IHEP) uses the open-source TORQUE/Maui for resource management and job scheduling.IHEP keeps a fair-use policy by dividing the computing resources of this cluster into multiple queues,and limiting the maximum number of running jobs of each user.However,this leads up to a low overall resource usage of the cluster.SLURM and HTCondor are both popular open-source resource management system.SLURM has plenty of job scheduling policy,while HTCondor well suits high throughput computing.Both of them are the possible solutions of resource management for computer clusters,replacing old,lack-of-service TORQUE/Maui.In this paper,job submission behavior of users from Daya Bay experiment was simulated at SLURM and HTCondor testing cluster,testing the resource allocation behaviors and efficiencies of SLURM and HTCondor.Their scheduling results were then compared with the actual scheduling result of the same jobs on IHEP TORQUE/Maui cluster.Finally the strengths and weaknesses of SLURM and HTCondor were analyzed,and the practicability of using SLURM or HTCondor to manage the IHEP computer cluster was discussed.

Key words: Resource management system,Job scheduler,Computer cluster,High throughput computing,High energy physics computing

[1] Building a Beowulf Cluster in just 13 steps [EB/OL].(2009-05-13)[2016-07-30].https://www.linux.com/blog/building-beowulf-cluster-just-13-steps.
[2] CERN.Computing [EB/OL].[2016-07-29].https://home.cern/about/computing.
[3] European Grid Initiative.Glossary V1 [EB/OL].(2016-07-19)[2016-07-22].https://wiki.egi.eu/wiki/Glossary_V1.
[4] HENDERSON R L.Job scheduling under the Portable BatchSystem [M]∥Job Scheduling Strategies for Parallel Proces-sing.Springer,Berlin,Heidelberg,1995:279-294.
[5] YOO A B,JETTE M A,GRONDONA M.SLURM:SimpleLinux Utility for Resource Management [M]∥Job Scheduling Strategies for Parallel Processing.Springer,Berlin,Heidelberg,2003:44-60.
[6] LITZKOW M,LIVNY M,MUTKA M.Condor-A Hunter ofIdle Workstations [C]∥Proceedings of the 8th International Conference of Distributed Computing Systems.IEEE,1988:104-111.
[7] XU R S,LANG P F,CHEN Y Q,et al.BES Offline Data Processing [J].High Energy Physics and Nuclear Physics,1991,15(7):577-583.(in Chinese) 许榕生,郎鹏飞,陈雅青,等.北京谱仪数据的离线处理[J].高能物理与核物理,1991,15(7):577-583.
[8] WANG Y F.A Neutrino Experiment Using the Daya Bay Reactor [J].Physics,2007,36(3):207-214.(in Chinese) 王贻芳.大亚湾反应堆中微子实验[J].物理,2007,36(3):207-214.
[9] NIE S M,ZHANG J L,TAN Y H,et al.Real Time Transmission and Analysis of the Yangbajing Cosmic Rays Observation Data [J].Nuclear Electronics and Detection Technology,2007,27(1):14-17.(in Chinese) 聂思敏,张吉龙,谭有恒,等.羊八井宇宙线观测数据实时传输及处理系统[J].核电子学与探测技术,2007,27(1):14-17.
[10] 江门中微子实验[EB/OL].http://www.ihep.cas.cn/dkxzz/juno.
[11] 高海拔宇宙线观测站[EB/OL].http://www.ihep.cas.cn/dkxzz/lhaaso.
[12] TORQUE Resource Manager-Adaptive Computing[EB/OL].http://www.adaptivecomputing.com/products/open-source/torque.
[13] Maui-Adaptive Computing [EB/OL].http://www.adaptive-computing.com/products/open-source/maui.
[14] RAMAN R,LIVNY M,SOLOMON M.Matchmaking:Distri-buted Resource Management for High Throughput Computing [C]∥Proceedings of the Seventh IEEE International Sympo-sium on High Performance Distributed Computing.Chicago,1998.
[15] LAHIFF A,DEWHURST A,KELLY J,et al.HTCondor at the RAL Tier-1 .https://indico.cern.ch/event/272785/contributions/1612799.
[16] Center for High Throughput Computing,University of Wisconsin-Madison.HTCondor Manual [EB/OL].http://research.cs.wisc.edu/htcondor/manual/v8.5/3_1Introduction.html.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .