计算机科学 ›› 2022, Vol. 49 ›› Issue (6): 108-118.doi: 10.11896/jsjkx.210300259

• 高性能计算 • 上一篇    下一篇

基于GPU的并行DILU预处理技术

汪晋, 刘江   

  1. 1 中国科学院重庆绿色智能技术研究院 重庆 400714
    2 中国科学院大学计算机科学与技术学院 北京 100049
  • 收稿日期:2021-03-25 修回日期:2021-07-27 出版日期:2022-06-15 发布日期:2022-06-08
  • 通讯作者: 刘江(liujiang@cigit.ac.cn)
  • 作者简介:(wangjin@cigit.ac.cn)
  • 基金资助:
    国家自然科学基金(61672488);国家重点研究发展计划(2018YFC0116704)

GPU-based Parallel DILU Preconditioning Technique

WANG Jin, LIU Jiang   

  1. 1 Chongqing Institute of Green and Intelligent Technology,Chinese Academy of Sciences,Chongqing 400714,China
    2 School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2021-03-25 Revised:2021-07-27 Online:2022-06-15 Published:2022-06-08
  • About author:WANG Jin,born in 1995,postgraduate.His main research interests include pa-rallel computing and iterative methods for linear systems.
    LIU Jiang,born in 1979,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include computability theory,formal methods and computer algorithms.
  • Supported by:
    National Natural Science Foundation of China(61672488) and National Key R & D Program of China(2018YFC0116704).

摘要: 在科学计算和工程领域,大型稀疏线性方程组的求解非常常见,目前已经有许多迭代方法和预处理技术被用于求解这类方程。DILU预处理技术类似于ILU,是开源计算流体力学软件OpenFOAM中重要的预处理技术,但未在OpenFOAM以外的领域引起关注,目前也没有完整的GPU实现。比较了DILU和ILU预处理技术对稳定双共轭梯度法(BiCGStab)加速的效果,以及它们在构造预处理子上的开销,结果表明,DILU在加速效果上不逊于ILU且在稳定性上优于ILU。在GPU并行实现方面,DILU可以使用分层并行和无全局同步并行两种并行策略,详细讨论了DILU预处理技术在这两种策略下的实现方法,给出了相关的算法和参考代码,然后比较了在两种并行策略下DILU预处理技术的性能。数值实验结果表明,在实践中两种并行策略各有优劣,可以根据实际表现进行选择。另外比较了GPU和CPU执行的DILU预处理技术,GPU在性能上具有明显优势,在线性方程组求解上存在性能瓶颈的程序可以移植到GPU平台以提升性能。

关键词: CUDA, DILU, GPU, OpenFOAM, 迭代法, 稀疏三角方程, 预处理

Abstract: Large sparse linear equations often appear in scientific computation and engineering.There are many iterative methods and preconditioning techniques for solving these linear equations.Diagonal-based incomplete LU (DILU) is a preconditioning technique similar to incomplete LU (ILU) factorization.DILU is applied in OpenFOAM,an open source computational fluid dynamics software,and is a very important preconditioning technique in OpenFOAM.DILU has not received extensive attention outside OpenFOAM,and there is no complete GPU-based implementation so far.This paper compares DILU preconditioned BiCGStab with ILU preconditioned BiCGStab,and the time elapses in preconditioner constructions.The numeric experiments suggest that DILU may be more efficient and stable than ILU.As for GPU-based parallel implementations,this paper discusses two parallel schemes,that are level-set scheme and synchronization-free scheme,and gives related algorithms and some codes under these two parallel schemes.It compares the performances of DILU preconditioning technique under two parallel schemes.The numeric results show that each scheme has its own advantages and disadvantages in different equations,and we can select one according to their performances in practice.This paper compares the performance of DILU preconditioning on GPU and CPU,and the results show that GPU is more competitive.The applications that have performance bottlenecks on linear systems solutions can be improved by moving to GPU platforms.

Key words: CUDA, DILU, GPU, Iterative methods, Open FOAM, Preconditioning, Sparse triangular equations

中图分类号: 

  • TP391.9
[1] SAAD Y.Iterative Methods for Sparse Linear Systems[M/OL].SIAM,2003.https://www-users.cse.umn.edu/~saad/itermethbook_2nded.pdf.
[2] GUTKNECHT M H.A brief introduction to Krylov spacemethods for solving linear systems[M]//Frontiers of Computational Science.Berlin:Springer,2007:53-62.
[3] LANGR D,TVRDIK P.Evaluation Criteria for Sparse Matrix Storage Formats[J].IEEE Transactions on Parallel and Distri-buted Systems,2016,27(2):428-440.
[4] DAVIS T A,RAJAMANICKAM S,SID-LAKHDAR W M.ASurvey of Direct Methods for Sparse Linear Systems[J].Acta Numerica,2016,25:383-566.
[5] HACKBUSCH W.Iterative Solution of Large Sparse Systems of Equations (2nd ed)[M].Berlin:Springer,2016.
[6] FILIPPONE S,CARDELLINI V,BARBIERI D,et al.SparseMatrix-vector Multiplication on GPGPUs[J].ACM Transactions on Mathematical Software (TOMS),2017,43(4):30.
[7] STEINBERGER M,DERLERY A,ZAYER R,et al.How Naive is Naive SpMV on the GPU?[C]//2016 IEEE High Perfor-mance Extreme Computing Conference(HPEC).IEEE,2016:1-8.
[8] GAO J,QI P,HE G.Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU[J].Mathematical Problems in Enginee-ring,2016,2016:1-14.
[9] BELL N,GARLAND M.Efficient sparse matrix-vector multiplication on CUDA:Nvidia Technical Report:NVR-2008-004[R].Nvidia Corporation,2008.
[10] BARRETT R,BERRY M W,CHAN T F,et al.Templates forthe Solution of Linear Systems:Building Blocks for Iterative Methods[M].SIAM Other Titles in Applied Mathematics,1994.
[11] BEHRENS T.OpenFOAM’s basic solvers for linear systems of equations[J/OL].Chalmers,Department of Applied Mechanics,2009,18(2).http://www.tfd.chalmers.se/~hani/kurser/os_cfd_2008/timberhrens/tibeh-report-fin.pdf.
[12] OpenFOAM Project[EB/OL].https://openfoam.com/.
[13] OWENS J D,HOUSTON M,LUEBKE D,et al.GPU computing[J].Proceedings of the IEEE,2008,96(5):879-899.
[14] MEIJERINK J A,VAN DER VORST H A.An Iterative Solution Method for Linear Systems of Which the Coefficient Matrix is a Symmetric M-Matrix[J].Mathematics of Computation,1977,31(137):148-162.
[15] KERSHAW D S.The incomplete Cholesky-conjugate gradient method for the iterative solution of systems of linear equations[J].Journal of Computational Physics,1978,26(1):43-65.
[16] SAAD Y.ILUT:A dual threshold incomplete LU factorization[J].Numerical Linear Algebra with Applications,1994,1(4):387-402.
[17] POMMERELL C.Solution of large unsymmetric systems of li-near equations[D].ETH Zürich:Swiss Federal Institute of Technology Zurich,1992.
[18] ANDERSON E,SAAD Y.Solving Sparse Triangular LinearSystem on Parallel Computers[J].International Journal of High Speed Computing,1989,1(1):73-95.
[19] NAUMOV M.Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU:NVIDIA Technical Report:NVR-2011-001[R].2011.
[20] LI R,SAAD Y.GPU-accelerated preconditioned iterative linear solvers[J].The Journal of Supercomputing,2013,63(2):443-466.
[21] NAUMOV M,CASTONGUAY P,COHEN J.Parallel GraphColoring with Applications to the Incomplete-LU Factorization on the GPU:NVIDIA Technical Report:NVR-2015-001[R].2015.
[22] LIU W,LI A,HOGG J,et al.A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C]//European Conference on Parallel Processing.Cham:Springer,2016:617-630.
[23] LI R.On parallel solution of sparse triangular linear systems in CUDA[J].arXiv:1710.04985,2017.
[24] SU J,ZHANG F,LIU W,et al.CapelliniSpTRSV:A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs[C]//49th International Conference on Parallel Processing-ICPP.2020.
[25] XIE C,CHEN J,FIROZ J S,et al.Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures[J].arXiv:.06959,2020.
[26] ANZT H,RIBIZEL T,FLEGAR G,et al.ParILUT-A Parallel Threshold ILU for GPUs[C]//2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).2019:231-241.
[27] CHEN Y,TIAN X,LIU H,et al.Parallel ILU preconditioners in GPU computation[J].Soft Computing,2018,22(24):8187-8205.
[28] NAUMOV M.Incomplete-LU and Cholesky preconditioned ite-rative methods using CUSPARSE and CUBLAS[J/OL].Nvidia white paper,2011:1-16.https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.230.8934&rep=rep1&type=pdf.
[29] BNA S,SPISSOA I,OLESENB M,et al.PETSc4FOAM:A Library to plug-in PETSc into the OpenFOAM Framework[J/OL].PRACE White paper,2020.https://prace-ri.eu/wp-content/uploads/WP294-PETSc4FOAM-A-Library-to-plug-in-PETSc-into-the-OpenFOAM-Framework.pdf.
[30] Paralution project[EB/OL].http://www.paralution.com.
[31] SAAD Y,SCHULTZ M H.GMRES:A generalized minimal residual algorithm for solving nonsymmetric linear systems[J].SIAM Journal on Scientific and Statistical Computing,1986,7(3):856-869.
[32] VAN DER VORST H A.Bi-CGSTAB:A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetric linear systems[J].SIAM Journal on Scientific and Statistical Computing,1992,13(2):631-644.
[33] JENSEN T R,TOFT B.Graph Coloring Problems[M/OL].1995.https://doi.org/10.3929/ethz-a-000669614.
[34] SALTZ J H.Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors[J].SIAM Journal on Scientific and Statistical Computing.1990,11(1):123-144.
[35] KNUTH D E.The art of computer programming,volume 3:(2nd ed.) sorting and searching[M].Addison Wesley Longman Publishing Co.,Inc.,1998,17(4):324.
[36] CORMEN T H,LEISERSON C E,RIVEST R L,et al.Introduction to algorithms second edition[M].MIT Press,2001.
[37] HORN R A,JOHNSON C R.The Hadamard product[M]//Topics in Matrix Analysis:Cambridge University Press,1991:298-381.
[1] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[2] 宗迪迪, 谢益武.
基于法线迭代的模型中轴生成方法
Model Medial Axis Generation Method Based on Normal Iteration
计算机科学, 2022, 49(6A): 764-770. https://doi.org/10.11896/jsjkx.210400050
[3] 周琴, 罗飞, 丁炜超, 顾春华, 郑帅.
基于逐次超松弛技术的Double Speedy Q-Learning算法
Double Speedy Q-Learning Based on Successive Over Relaxation
计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173
[4] 刘江, 刘文博, 张矩.
OpenFoam中多面体网格生成的MPI+OpenMP混合并行方法
Hybrid MPI+OpenMP Parallel Method on Polyhedral Grid Generation in OpenFoam
计算机科学, 2022, 49(3): 3-10. https://doi.org/10.11896/jsjkx.210700060
[5] 黄颖琦, 陈红梅.
基于代价敏感卷积神经网络的非平衡问题混合方法
Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification
计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013
[6] 李繁, 严星, 张晓宇.
基于GPU的特征脸算法优化研究
Optimization of GPU-based Eigenface Algorithm
计算机科学, 2021, 48(4): 197-204. https://doi.org/10.11896/jsjkx.200600033
[7] 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立.
基于GPU加速的并行WMD算法
Parallel WMD Algorithm Based on GPU Acceleration
计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213
[8] 文敏华, 汪申鹏, 韦建文, 李林颖, 张斌, 林新华.
基于DGX-2的湍流燃烧问题优化研究
DGX-2 Based Optimization of Application for Turbulent Combustion
计算机科学, 2021, 48(12): 43-48. https://doi.org/10.11896/jsjkx.201200129
[9] 汪亮, 周新志, 严华.
基于GPU的实时SIFT算法
Real-time SIFT Algorithm Based on GPU
计算机科学, 2020, 47(8): 105-111. https://doi.org/10.11896/jsjkx.190700036
[10] 倪晓军, 佘戌豪.
面向无线传感网络应用的改进LZW算法
Improvement of LZW Algorithms for Wireless Sensor Networks
计算机科学, 2020, 47(5): 260-264. https://doi.org/10.11896/jsjkx.190400108
[11] 刘世芳, 赵永华, 于天禹, 黄荣锋.
广义稠密对称特征问题标准化算法在GPU集群上的有效实现
Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster
计算机科学, 2020, 47(4): 6-12. https://doi.org/10.11896/jsjkx.191000009
[12] 左宪禹, 张哲, 苏岳瀚, 刘扬, 葛强, 田军锋.
基于GPU多流并发并行模型的NDVI提取算法
Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model
计算机科学, 2020, 47(4): 25-29. https://doi.org/10.11896/jsjkx.190500029
[13] 曹锋,徐扬,钟建,宁欣然.
基于目标演绎距离的一阶逻辑子句集预处理方法
First-order Logic Clause Set Preprocessing Method Based on Goal Deduction Distance
计算机科学, 2020, 47(3): 217-221. https://doi.org/10.11896/jsjkx.190100004
[14] 陈佳,欧阳金源,冯安琪,吴远,钱丽萍.
边缘计算构架下基于孤立森林算法的DoS异常检测
DoS Anomaly Detection Based on Isolation Forest Algorithm Under Edge Computing Framework
计算机科学, 2020, 47(2): 287-293. https://doi.org/10.11896/jsjkx.190100047
[15] 孙士远, 何勇.
自动机终结字查找算法的设计与实现
Designs and Implementations of Algorithms for Searching Terminal Words of Automata
计算机科学, 2020, 47(11A): 599-603. https://doi.org/10.11896/jsjkx.200300096
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!