大型高能物理计算集群资源管理方法的评测

doi:10.11896/j.issn.1002-137X.2017.10.016

Abstract

Abstract: High energy physics data consist of multiple events,among which there is no relativity.A high energy phy-sics computing mission is parallelized by running multiple jobs processing multiple different data files simultaneously.Therefore,high energy physics computing is a typical high throughput computing scenario.The computer cluster running at the institute of high energy physics (IHEP) uses the open-source TORQUE/Maui for resource management and job scheduling.IHEP keeps a fair-use policy by dividing the computing resources of this cluster into multiple queues,and limiting the maximum number of running jobs of each user.However,this leads up to a low overall resource usage of the cluster.SLURM and HTCondor are both popular open-source resource management system.SLURM has plenty of job scheduling policy,while HTCondor well suits high throughput computing.Both of them are the possible solutions of resource management for computer clusters,replacing old,lack-of-service TORQUE/Maui.In this paper,job submission behavior of users from Daya Bay experiment was simulated at SLURM and HTCondor testing cluster,testing the resource allocation behaviors and efficiencies of SLURM and HTCondor.Their scheduling results were then compared with the actual scheduling result of the same jobs on IHEP TORQUE/Maui cluster.Finally the strengths and weaknesses of SLURM and HTCondor were analyzed,and the practicability of using SLURM or HTCondor to manage the IHEP computer cluster was discussed.

Key words: Resource management system,Job scheduler,Computer cluster,High throughput computing,High energy physics computing

SUN Zhen-yu, SHI Jing-yan, JIANG Xiao-wei, ZOU Jia-heng and DU Ran. Evaluation of Resource Management Methods for Large High Energy Physics Computer Cluster[J].Computer Science, 2017, 44(10): 85-90.

References

[1] Building a Beowulf Cluster in just 13 steps [EB/OL].(2009-05-13)[2016-07-30].https://www.linux.com/blog/building-beowulf-cluster-just-13-steps.
[2] CERN.Computing [EB/OL].[2016-07-29].https://home.cern/about/computing.
[3] European Grid Initiative.Glossary V1 [EB/OL].(2016-07-19)[2016-07-22].https://wiki.egi.eu/wiki/Glossary_V1.
[4] HENDERSON R L.Job scheduling under the Portable BatchSystem [M]∥Job Scheduling Strategies for Parallel Proces-sing.Springer,Berlin,Heidelberg,1995:279-294.
[5] YOO A B,JETTE M A,GRONDONA M.SLURM:SimpleLinux Utility for Resource Management [M]∥Job Scheduling Strategies for Parallel Processing.Springer,Berlin,Heidelberg,2003:44-60.
[6] LITZKOW M,LIVNY M,MUTKA M.Condor-A Hunter ofIdle Workstations [C]∥Proceedings of the 8th International Conference of Distributed Computing Systems.IEEE,1988:104-111.
[7] XU R S,LANG P F,CHEN Y Q,et al.BES Offline Data Processing [J].High Energy Physics and Nuclear Physics,1991,15(7):577-583.(in Chinese) 许榕生,郎鹏飞,陈雅青,等.北京谱仪数据的离线处理[J].高能物理与核物理,1991,15(7):577-583.
[8] WANG Y F.A Neutrino Experiment Using the Daya Bay Reactor [J].Physics,2007,36(3):207-214.(in Chinese) 王贻芳.大亚湾反应堆中微子实验[J].物理,2007,36(3):207-214.
[9] NIE S M,ZHANG J L,TAN Y H,et al.Real Time Transmission and Analysis of the Yangbajing Cosmic Rays Observation Data [J].Nuclear Electronics and Detection Technology,2007,27(1):14-17.(in Chinese) 聂思敏,张吉龙,谭有恒,等.羊八井宇宙线观测数据实时传输及处理系统[J].核电子学与探测技术,2007,27(1):14-17.
[10] 江门中微子实验[EB/OL].http://www.ihep.cas.cn/dkxzz/juno.
[11] 高海拔宇宙线观测站[EB/OL].http://www.ihep.cas.cn/dkxzz/lhaaso.
[12] TORQUE Resource Manager-Adaptive Computing[EB/OL].http://www.adaptivecomputing.com/products/open-source/torque.
[13] Maui-Adaptive Computing [EB/OL].http://www.adaptive-computing.com/products/open-source/maui.
[14] RAMAN R,LIVNY M,SOLOMON M.Matchmaking:Distri-buted Resource Management for High Throughput Computing [C]∥Proceedings of the Seventh IEEE International Sympo-sium on High Performance Distributed Computing.Chicago,1998.
[15] LAHIFF A,DEWHURST A,KELLY J,et al.HTCondor at the RAL Tier-1 .https://indico.cern.ch/event/272785/contributions/1612799.
[16] Center for High Throughput Computing,University of Wisconsin-Madison.HTCondor Manual [EB/OL].http://research.cs.wisc.edu/htcondor/manual/v8.5/3_1Introduction.html.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Evaluation of Resource Management Methods for Large High Energy Physics Computer Cluster

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0