Computer Science ›› 2023, Vol. 50 ›› Issue (6): 52-57.doi: 10.11896/jsjkx.230200159

• High Performance Computing • Previous Articles     Next Articles

Lattice QCD Calculation and Optimization on ARM Processors

SUN Wei1, BI Yujiang1,2, CHENG Yaodong1,2,3   

  1. 1 Institute of High Energy Physics,Chinese Academy of Sciences,Beijing 100049,China
    2 Tianfu Cosmic Ray Research Center,Chengdu 610213,China
    3 School of Nuclear Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2023-02-22 Revised:2023-04-21 Online:2023-06-15 Published:2023-06-06
  • About author:SUN Wei,born in 1992,Ph.D.His main research interests include high perfor-mance computing,lattice quantum chromodynamics and quantum computing.
  • Supported by:
    National Natural Science Foundation of China(12205311,11935017,12075253,12175063),China Postdoctoral Science Foundation(2022M713154) and Science and Technology Innovation Project of Institute of High Energy Physics(E15451U2).

Abstract: Lattice quantum chromodynamics(lattice QCD) is one of the most important applications of large-scale parallel computing in high energy physics,researches in this field usually consume a large amount of computing resources,and its core is to solve the large scale sparse linear equations.Based on the domestic Kunpeng 920 ARM processor,this paper studies the hot spot of lattice QCD calculation,the Dslash,which is applied on up to 64 nodes(6 144 cores) and show the linear scalability.Based on the roofline performance analysis model,we find that lattice QCD is a typical memory bound application,and by using the compression of 3×3 complex unitary matrices in Dslash based on symmetry,we can improve the performance of Dslash by 22%.For the solving of large scale sparse linear equations,we also explore the usual Krylov subspace iterative algorithm such as BiCGStab and the newly developed state-of-art multigrid algorithm on the same ARM processor,and find that in the practical physics calculation the multigrid algorithm is several times to a magnitude faster than BiCGStab,even including the multigrid setup time.Moreover,we consider the NEON vectorization instructions on Kunpeng 920,and there is up to 20% improvement for multigrid algorithm.Therefore,the use of multigrid algorithm on ARM processors can speed up the physics research tremendously.

Key words: Lattice QCD, ARM architecture, Multigrid algorithm, Kunpeng 920, NEON vectorization

CLC Number: 

  • TP391
[1]WILSON G K.Confinement of Quarks[J].Physical Review D,1974,10(8):2445-2459.
[2]CREUTZ M.Monte Carlo Study of Quantized SU(2) GaugeTheory[J].Physical Review D,1980,21(8):2308-2315.
[3]HABIB S,ROSER R,GERBER R,et al.ASCR/HEP Exascale Requirements Review Report[J].arXiv:1603.09303,2016.
[4]EGRI G,FODOR Z,HOELBLING C,et al.Lattice QCD as avideo game[J].Computer Physics Communications,2007,177(8):631-639.
[5]CLARK M,BABICH R,BARROS K,et al.Solving lattice QCD systems of equations using mixed precision solvers on GPUs[J].Computer Physics Communications,2010,181(9):1517-1528.
[6]JACKSON A,TURNER A,WEILAND M,et al.Evaluating the Arm Ecosystem for High Performance Computing[C]//Proceedings of the Platform for Advanced Scientific Computing Conference.2019:1-11.
[7]CHEN D,CHRIST N,DONG Z,et al.QCDOC:A 10-teraflops scale computer for lattice QCD[J].Nuclear Physics B-Procee-dings Supplements,2001,94(1):825-832.
[8]MEYER N,PLEITER D,SOLBRIG S,et al.Lattice QCD on upcoming Arm architectures[C]//The 36th Annual International Symposium on Lattice Field Theory.2018.
[9]ISHIKAWA K,KANAMORI I,MATSUFURU H.Multigrid solver on Fugaku[C]//The 39th Annual International Sympo-sium on Lattice Field Theory.2021.
[10]XIA J,CHENG C,ZHOUX,et al.Kunpeng 920:The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services[J]. IEEE Micro,2021,41(5):67-75.
[11]EDWARDS R,JOO B.The Chroma software system for lattice QCD[J].Nuclear Physics B-Proceedings Supplements,2005,140:832-834.
[12]Free Software Foundation.GCC,the GNU Compiler Collection[EB/OL].https://gcc.gnu.org.
[13]The Open MPI Project.Open MPI:Open Source High Perfor-mance Computing[EB/OL].https://www.open-mpi.org.
[14]WILLIAMS S,WATERMAN A,PATTERSON D.Roofline:An Insightful Visual Performance Model for Multicore Architectures[J].Communications of the ACM,2009,52(4):65-76.
[15]MAGNUS H,EDUARD S.Methods of conjugate gradients for solving linear systems[J].Journal of research of the National Bureau of Standards,1952,49:409-435.
[16]WATSON A.Conjugate gradient methods for indefinite systems[M]//Numerical Analysis.Berlin:Springer,1976:73-89.
[17]VAN DER VORST H.Bi-CGSTAB:A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Li-near Systems[J].SIAM Journal on Scientific and Statistical Computing,1992,13(2):631-644.
[18]BABICH R,BRANNICK J,BROWER R,et al.Adaptive Multigrid Algorithm for the Lattice Wilson-Dirac Operator[J].Physical Review Letters,2010,105(20):201602.
[19]CLARK M,JOO B,STRELCHENKO A,et al.Accelerating lattice QCD multigrid on GPUs using fine-grained parallelization[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2016:1-12.
[20]YOUCEF S.A Flexible Inner-Outer Preconditioned GMRES Algorithm[J].SIAM Journal on Scientific Computing,1993,14(2):461-469.
[21]INTEL.Intel Instruction Set Extensions Technology[EB/OL].https://www.intel.com/content/www/us/en/support/articles/000005779/processors.html.
[22]ARM.Arm NEON Technology[EB/OL].https://developer.arm.com/Architectures/Neon.
[23]ZHANG R,SUN W,CHEN Y,et al.The glueball content of etac[J].Physics Letters B,2022,827:136960.
[24]ZHANG R,SUN W,CHENF,et al.Annihilation diagram contribution to charmonium masses[J].Chinese Physics C,2022,46(4):043102.
[1] JIN Yu-yan, YU Tian-hao, WANG Song-bo, LIN Wei-wei, PAN Yu-cong. CPU Power Model for ARM Architecture Cloud Servers [J]. Computer Science, 2022, 49(10): 59-65.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!