计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 52-57.doi: 10.11896/jsjkx.230200159

• 高性能计算 • 上一篇    下一篇

ARM处理器上的格点QCD计算与优化

孙玮1, 毕玉江1,2, 程耀东1,2,3   

  1. 1 中国科学院高能物理研究所 北京 100049
    2 四川天府新区宇宙线研究中心 成都 610213
    3 中国科学院大学核科学与技术学院 北京 100049
  • 收稿日期:2023-02-22 修回日期:2023-04-21 出版日期:2023-06-15 发布日期:2023-06-06
  • 通讯作者: 孙玮(sunwei@ihep.ac.cn)
  • 基金资助:
    国家自然科学基金(12205311,11935017,12075253,12175063);中国博士后科学基金面上资助(2022M713154);高能物理研究所科技创新项目(E15451U2)

Lattice QCD Calculation and Optimization on ARM Processors

SUN Wei1, BI Yujiang1,2, CHENG Yaodong1,2,3   

  1. 1 Institute of High Energy Physics,Chinese Academy of Sciences,Beijing 100049,China
    2 Tianfu Cosmic Ray Research Center,Chengdu 610213,China
    3 School of Nuclear Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2023-02-22 Revised:2023-04-21 Online:2023-06-15 Published:2023-06-06
  • About author:SUN Wei,born in 1992,Ph.D.His main research interests include high perfor-mance computing,lattice quantum chromodynamics and quantum computing.
  • Supported by:
    National Natural Science Foundation of China(12205311,11935017,12075253,12175063),China Postdoctoral Science Foundation(2022M713154) and Science and Technology Innovation Project of Institute of High Energy Physics(E15451U2).

摘要: 格点量子色动力学(格点QCD)是高能物理领域中需要大规模并行计算的最主要应用之一,相关研究通常需要消耗大量计算资源,核心是求解大规模稀疏线性方程组。文中基于国产鲲鹏920 ARM处理器,研究了格点QCD的计算热点Dslash,并将其扩展到64个节点(6 144核),展示了格点QCD计算的线性扩展性。 基于roofline性能分析模型,发现格点QCD是典型的内存限制应用,并通过将Dslash中的3×3复幺正矩阵根据对称性压缩,将其性能提升约22%。对于大规模稀疏线性方程的求解,在ARM处理器上探索了常用的Krylov子空间迭代算法BiCGStab,以及近年来发展起来的前沿的multigrid算法,发现即使考虑预处理时间,在实际物理计算中使用multigrid算法相比BiCGStab依然有几倍至一个数量级的加速。此外,还考虑了鲲鹏920处理器上的NEON向量化指令,发现将其用于multigrid计算时可以带来约20%的加速。因此,在ARM处理器上使用multigrid算法能极大地加速实际的物理研究。

关键词: 格点QCD, ARM架构, 多重网格算法, 鲲鹏920, NEON向量化

Abstract: Lattice quantum chromodynamics(lattice QCD) is one of the most important applications of large-scale parallel computing in high energy physics,researches in this field usually consume a large amount of computing resources,and its core is to solve the large scale sparse linear equations.Based on the domestic Kunpeng 920 ARM processor,this paper studies the hot spot of lattice QCD calculation,the Dslash,which is applied on up to 64 nodes(6 144 cores) and show the linear scalability.Based on the roofline performance analysis model,we find that lattice QCD is a typical memory bound application,and by using the compression of 3×3 complex unitary matrices in Dslash based on symmetry,we can improve the performance of Dslash by 22%.For the solving of large scale sparse linear equations,we also explore the usual Krylov subspace iterative algorithm such as BiCGStab and the newly developed state-of-art multigrid algorithm on the same ARM processor,and find that in the practical physics calculation the multigrid algorithm is several times to a magnitude faster than BiCGStab,even including the multigrid setup time.Moreover,we consider the NEON vectorization instructions on Kunpeng 920,and there is up to 20% improvement for multigrid algorithm.Therefore,the use of multigrid algorithm on ARM processors can speed up the physics research tremendously.

Key words: Lattice QCD, ARM architecture, Multigrid algorithm, Kunpeng 920, NEON vectorization

中图分类号: 

  • TP391
[1]WILSON G K.Confinement of Quarks[J].Physical Review D,1974,10(8):2445-2459.
[2]CREUTZ M.Monte Carlo Study of Quantized SU(2) GaugeTheory[J].Physical Review D,1980,21(8):2308-2315.
[3]HABIB S,ROSER R,GERBER R,et al.ASCR/HEP Exascale Requirements Review Report[J].arXiv:1603.09303,2016.
[4]EGRI G,FODOR Z,HOELBLING C,et al.Lattice QCD as avideo game[J].Computer Physics Communications,2007,177(8):631-639.
[5]CLARK M,BABICH R,BARROS K,et al.Solving lattice QCD systems of equations using mixed precision solvers on GPUs[J].Computer Physics Communications,2010,181(9):1517-1528.
[6]JACKSON A,TURNER A,WEILAND M,et al.Evaluating the Arm Ecosystem for High Performance Computing[C]//Proceedings of the Platform for Advanced Scientific Computing Conference.2019:1-11.
[7]CHEN D,CHRIST N,DONG Z,et al.QCDOC:A 10-teraflops scale computer for lattice QCD[J].Nuclear Physics B-Procee-dings Supplements,2001,94(1):825-832.
[8]MEYER N,PLEITER D,SOLBRIG S,et al.Lattice QCD on upcoming Arm architectures[C]//The 36th Annual International Symposium on Lattice Field Theory.2018.
[9]ISHIKAWA K,KANAMORI I,MATSUFURU H.Multigrid solver on Fugaku[C]//The 39th Annual International Sympo-sium on Lattice Field Theory.2021.
[10]XIA J,CHENG C,ZHOUX,et al.Kunpeng 920:The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services[J]. IEEE Micro,2021,41(5):67-75.
[11]EDWARDS R,JOO B.The Chroma software system for lattice QCD[J].Nuclear Physics B-Proceedings Supplements,2005,140:832-834.
[12]Free Software Foundation.GCC,the GNU Compiler Collection[EB/OL].https://gcc.gnu.org.
[13]The Open MPI Project.Open MPI:Open Source High Perfor-mance Computing[EB/OL].https://www.open-mpi.org.
[14]WILLIAMS S,WATERMAN A,PATTERSON D.Roofline:An Insightful Visual Performance Model for Multicore Architectures[J].Communications of the ACM,2009,52(4):65-76.
[15]MAGNUS H,EDUARD S.Methods of conjugate gradients for solving linear systems[J].Journal of research of the National Bureau of Standards,1952,49:409-435.
[16]WATSON A.Conjugate gradient methods for indefinite systems[M]//Numerical Analysis.Berlin:Springer,1976:73-89.
[17]VAN DER VORST H.Bi-CGSTAB:A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Li-near Systems[J].SIAM Journal on Scientific and Statistical Computing,1992,13(2):631-644.
[18]BABICH R,BRANNICK J,BROWER R,et al.Adaptive Multigrid Algorithm for the Lattice Wilson-Dirac Operator[J].Physical Review Letters,2010,105(20):201602.
[19]CLARK M,JOO B,STRELCHENKO A,et al.Accelerating lattice QCD multigrid on GPUs using fine-grained parallelization[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2016:1-12.
[20]YOUCEF S.A Flexible Inner-Outer Preconditioned GMRES Algorithm[J].SIAM Journal on Scientific Computing,1993,14(2):461-469.
[21]INTEL.Intel Instruction Set Extensions Technology[EB/OL].https://www.intel.com/content/www/us/en/support/articles/000005779/processors.html.
[22]ARM.Arm NEON Technology[EB/OL].https://developer.arm.com/Architectures/Neon.
[23]ZHANG R,SUN W,CHEN Y,et al.The glueball content of etac[J].Physics Letters B,2022,827:136960.
[24]ZHANG R,SUN W,CHENF,et al.Annihilation diagram contribution to charmonium masses[J].Chinese Physics C,2022,46(4):043102.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!