计算机科学 ›› 2017, Vol. 44 ›› Issue (10): 85-90.doi: 10.11896/j.issn.1002-137X.2017.10.016
孙震宇,石京燕,姜晓巍,邹佳恒,杜然
SUN Zhen-yu, SHI Jing-yan, JIANG Xiao-wei, ZOU Jia-heng and DU Ran
摘要: 高能物理数据由物理事例组成,事例之间没有相关性。可以通过大量作业同时处理大量不同的数据文件,从而实现高能物理计算任务的并行化,因此高能物理计算是典型的高吞吐量计算场景。高能所计算集群使用开源的TORQUE/Maui进行资源管理及作业调度,并通过将集群资源划分成不同队列以及限制用户最大运行作业数来保证公平性,然而这也导致了集群整体资源利用率非常低下。SLURM和HTCondor都是近年来流行的开源资源管理系统,前者拥有丰富的作业调度策略,后者非常适合高吞吐量计算,二者都能够替代老旧、缺乏维护的TORQUE/Maui,都是管理计算集群资源的可行方案。在SLURM和HTCondor测试集群上模拟大亚湾实验用户的作业提交行为,对SLURM和HTCondor的资源分配行为和效率进行了测试,并与相同作业在高能物理研究所TORQUE/Maui集群上的实际调度结果进行了对比,分析了SLURM及HTCondor的优势和不足,探讨了使用SLURM或HTCondor管理高能物理研究所计算集群的可行性。
[1] Building a Beowulf Cluster in just 13 steps [EB/OL].(2009-05-13)[2016-07-30].https://www.linux.com/blog/building-beowulf-cluster-just-13-steps. [2] CERN.Computing [EB/OL].[2016-07-29].https://home.cern/about/computing. [3] European Grid Initiative.Glossary V1 [EB/OL].(2016-07-19)[2016-07-22].https://wiki.egi.eu/wiki/Glossary_V1. [4] HENDERSON R L.Job scheduling under the Portable BatchSystem [M]∥Job Scheduling Strategies for Parallel Proces-sing.Springer,Berlin,Heidelberg,1995:279-294. [5] YOO A B,JETTE M A,GRONDONA M.SLURM:SimpleLinux Utility for Resource Management [M]∥Job Scheduling Strategies for Parallel Processing.Springer,Berlin,Heidelberg,2003:44-60. [6] LITZKOW M,LIVNY M,MUTKA M.Condor-A Hunter ofIdle Workstations [C]∥Proceedings of the 8th International Conference of Distributed Computing Systems.IEEE,1988:104-111. [7] XU R S,LANG P F,CHEN Y Q,et al.BES Offline Data Processing [J].High Energy Physics and Nuclear Physics,1991,15(7):577-583.(in Chinese) 许榕生,郎鹏飞,陈雅青,等.北京谱仪数据的离线处理[J].高能物理与核物理,1991,15(7):577-583. [8] WANG Y F.A Neutrino Experiment Using the Daya Bay Reactor [J].Physics,2007,36(3):207-214.(in Chinese) 王贻芳.大亚湾反应堆中微子实验[J].物理,2007,36(3):207-214. [9] NIE S M,ZHANG J L,TAN Y H,et al.Real Time Transmission and Analysis of the Yangbajing Cosmic Rays Observation Data [J].Nuclear Electronics and Detection Technology,2007,27(1):14-17.(in Chinese) 聂思敏,张吉龙,谭有恒,等.羊八井宇宙线观测数据实时传输及处理系统[J].核电子学与探测技术,2007,27(1):14-17. [10] 江门中微子实验[EB/OL].http://www.ihep.cas.cn/dkxzz/juno. [11] 高海拔宇宙线观测站[EB/OL].http://www.ihep.cas.cn/dkxzz/lhaaso. [12] TORQUE Resource Manager-Adaptive Computing[EB/OL].http://www.adaptivecomputing.com/products/open-source/torque. [13] Maui-Adaptive Computing [EB/OL].http://www.adaptive-computing.com/products/open-source/maui. [14] RAMAN R,LIVNY M,SOLOMON M.Matchmaking:Distri-buted Resource Management for High Throughput Computing [C]∥Proceedings of the Seventh IEEE International Sympo-sium on High Performance Distributed Computing.Chicago,1998. [15] LAHIFF A,DEWHURST A,KELLY J,et al.HTCondor at the RAL Tier-1 .https://indico.cern.ch/event/272785/contributions/1612799. [16] Center for High Throughput Computing,University of Wisconsin-Madison.HTCondor Manual [EB/OL].http://research.cs.wisc.edu/htcondor/manual/v8.5/3_1Introduction.html. |
No related articles found! |
|