计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 52-57.doi: 10.11896/jsjkx.230200159
孙玮1, 毕玉江1,2, 程耀东1,2,3
SUN Wei1, BI Yujiang1,2, CHENG Yaodong1,2,3
摘要: 格点量子色动力学(格点QCD)是高能物理领域中需要大规模并行计算的最主要应用之一,相关研究通常需要消耗大量计算资源,核心是求解大规模稀疏线性方程组。文中基于国产鲲鹏920 ARM处理器,研究了格点QCD的计算热点Dslash,并将其扩展到64个节点(6 144核),展示了格点QCD计算的线性扩展性。 基于roofline性能分析模型,发现格点QCD是典型的内存限制应用,并通过将Dslash中的3×3复幺正矩阵根据对称性压缩,将其性能提升约22%。对于大规模稀疏线性方程的求解,在ARM处理器上探索了常用的Krylov子空间迭代算法BiCGStab,以及近年来发展起来的前沿的multigrid算法,发现即使考虑预处理时间,在实际物理计算中使用multigrid算法相比BiCGStab依然有几倍至一个数量级的加速。此外,还考虑了鲲鹏920处理器上的NEON向量化指令,发现将其用于multigrid计算时可以带来约20%的加速。因此,在ARM处理器上使用multigrid算法能极大地加速实际的物理研究。
中图分类号:
[1]WILSON G K.Confinement of Quarks[J].Physical Review D,1974,10(8):2445-2459. [2]CREUTZ M.Monte Carlo Study of Quantized SU(2) GaugeTheory[J].Physical Review D,1980,21(8):2308-2315. [3]HABIB S,ROSER R,GERBER R,et al.ASCR/HEP Exascale Requirements Review Report[J].arXiv:1603.09303,2016. [4]EGRI G,FODOR Z,HOELBLING C,et al.Lattice QCD as avideo game[J].Computer Physics Communications,2007,177(8):631-639. [5]CLARK M,BABICH R,BARROS K,et al.Solving lattice QCD systems of equations using mixed precision solvers on GPUs[J].Computer Physics Communications,2010,181(9):1517-1528. [6]JACKSON A,TURNER A,WEILAND M,et al.Evaluating the Arm Ecosystem for High Performance Computing[C]//Proceedings of the Platform for Advanced Scientific Computing Conference.2019:1-11. [7]CHEN D,CHRIST N,DONG Z,et al.QCDOC:A 10-teraflops scale computer for lattice QCD[J].Nuclear Physics B-Procee-dings Supplements,2001,94(1):825-832. [8]MEYER N,PLEITER D,SOLBRIG S,et al.Lattice QCD on upcoming Arm architectures[C]//The 36th Annual International Symposium on Lattice Field Theory.2018. [9]ISHIKAWA K,KANAMORI I,MATSUFURU H.Multigrid solver on Fugaku[C]//The 39th Annual International Sympo-sium on Lattice Field Theory.2021. [10]XIA J,CHENG C,ZHOUX,et al.Kunpeng 920:The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services[J]. IEEE Micro,2021,41(5):67-75. [11]EDWARDS R,JOO B.The Chroma software system for lattice QCD[J].Nuclear Physics B-Proceedings Supplements,2005,140:832-834. [12]Free Software Foundation.GCC,the GNU Compiler Collection[EB/OL].https://gcc.gnu.org. [13]The Open MPI Project.Open MPI:Open Source High Perfor-mance Computing[EB/OL].https://www.open-mpi.org. [14]WILLIAMS S,WATERMAN A,PATTERSON D.Roofline:An Insightful Visual Performance Model for Multicore Architectures[J].Communications of the ACM,2009,52(4):65-76. [15]MAGNUS H,EDUARD S.Methods of conjugate gradients for solving linear systems[J].Journal of research of the National Bureau of Standards,1952,49:409-435. [16]WATSON A.Conjugate gradient methods for indefinite systems[M]//Numerical Analysis.Berlin:Springer,1976:73-89. [17]VAN DER VORST H.Bi-CGSTAB:A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Li-near Systems[J].SIAM Journal on Scientific and Statistical Computing,1992,13(2):631-644. [18]BABICH R,BRANNICK J,BROWER R,et al.Adaptive Multigrid Algorithm for the Lattice Wilson-Dirac Operator[J].Physical Review Letters,2010,105(20):201602. [19]CLARK M,JOO B,STRELCHENKO A,et al.Accelerating lattice QCD multigrid on GPUs using fine-grained parallelization[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2016:1-12. [20]YOUCEF S.A Flexible Inner-Outer Preconditioned GMRES Algorithm[J].SIAM Journal on Scientific Computing,1993,14(2):461-469. [21]INTEL.Intel Instruction Set Extensions Technology[EB/OL].https://www.intel.com/content/www/us/en/support/articles/000005779/processors.html. [22]ARM.Arm NEON Technology[EB/OL].https://developer.arm.com/Architectures/Neon. [23]ZHANG R,SUN W,CHEN Y,et al.The glueball content of etac[J].Physics Letters B,2022,827:136960. [24]ZHANG R,SUN W,CHENF,et al.Annihilation diagram contribution to charmonium masses[J].Chinese Physics C,2022,46(4):043102. |
|