非结构网格下稀疏下三角方程求解器众核优化技术研究

摘要/Abstract

摘要： 稀疏下三角方程求解器(SpTRSV)作为基础线性代数库中一个重要的算法,在大规模科学计算中有着广泛应用。在非结构网格中,由于非结构网格具有数据存储无序性、数据强相关性以及频繁地离散访存等特点,该算法在众核架构上难以实现有效的并行。文中基于国产异构众核处理器SW26010体系结构的特点,针对非结构网格计算,提出了一种基于流水线串行-局部并行思想的通用众核优化方法。该方法能够有效减少非结构网格计算中的随机访存,提高计算效率,并且具有很好的扩展性。基于该算法对多个实际应用算例进行众核优化,实验结果表明:该方法能够实现单核组3倍以上的加速,显著降低了运行时间。

关键词: SW26010, 并行算法, 非结构网格, 稀疏下三角方程求解器, 异构众核优化

Abstract: Sparse Triangular Solver (SpTRSV),as an important algorithm in basic linear algebraic library,has been widely used in large-scale scientific computing.In unstructured-grids,because unstructured grid have the characte-ristics of data storage disorder,data depth correlation and frequent discrete-time memory access,this algorithm is difficult to achieve effective parallelism in the many-core architecture.In this paper,based on the architecture of the domestic heterogeneous multiprocessor SW26010 architecture,a general kernel optimization method based on pipelined serial and local parallel was proposed for unstructured grid computing.This method can effectively reduce random access in unstructured grid computing,improve the computing efficiency,and have the good scalability.Based on this algorithm,multiple kernel optimization is carried out for several practical applications.The experimental results show that the method can achieve more than 3 times acceleration of the single core group and significantly reduce the running time.

Key words: Heterogeneous many-core Optimization, Parallel algorithm, SpTRSV, SW26010, Unstructured-grids

中图分类号:

TP311

倪鸿, 刘鑫. 非结构网格下稀疏下三角方程求解器众核优化技术研究[J]. 计算机科学, 2019, 46(6A): 518-522. https://doi.org/

NI Hong, LIU Xin. Many-core Optimization for Sparse Triangular Solver Under Unstructured Grids[J]. Computer Science, 2019, 46(6A): 518-522. https://doi.org/

参考文献

[1]ANDERSON E,SAAD Y.Solving sparse triangular linear systems on parallel computers[J].International Journal of High Speed Computing,1989,1(1):73-95.
[2]DUFF I S,ERISMAN A M,REID J K.Direct Methods for Sparse Matrices[J].Matehematics of Computation,1986,52(185):250.
[3]SAAD Y.Iterative methods for sparse linear systems[OL].http://ieeexplore.ieee.org/ielx5/99/12132/001231631.pdf.
[4]REGULY I Z,MUDALIGE G R,BERTOLLI C,et al.Acceleration of a full-scale industrial cfd application with op2.IEEE Transactionson Parallel and Distributed Systems,2016,27(5):1265-1278.
[5]YAO C,YANG X,WANG W,et al.26 pflops stencil computations for atmospheric modeling on sunway taihulight[C]∥2017 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).IEEE,2017:535-544.
[6]CHEN T,LI M,LI Y,et al.Mxnet:A flexible and efficient machine learning library for heterogeneous distributed systems[J].arXiv preprint arXiv:1512.01274,2015.
[7]ANDERSON E,SAAD Y.Solving sparse triangular linear systems on parallel computers[J].International Journal of High Speed Computing,1989,1(1):73-95.
[8]KALIKAR S,NASRE R.Domlock:A new multi-granularity locking technique for hierarchies[C]∥Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,2016.
[9]SCHREIBER R,TANG W P.Vectorizing the conjugate gradient method[D].Los Angeles:Stanford University,1982.
[10]LIU W,LI A,HOGG J D,et al.Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides[J].Concurrency and Computation:Practice and Experience,2017,29(21):e4244-n/a.
[11]PICCIAU A,INGGS G E,WICKERSON J,et al.Balancing locality and concurrency:Solving sparse triangular systems on gpus[C]∥2016 IEEE 23rd International Conference on High Performance Computing(HiPC).2016:183-192.
[12]敖玉龙.国产大型众核系统上稀疏矩阵和Stencil运算的性能优化关键技术研究[D].北京:中科院软件研究所,2017:27-47.
[13]王欣亮.关键稀疏数值计算核心在国产众核架构上的性能优化研究[D].北京:清华大学,2018:24-43.
[14]QI F B.Sunway TaihuLight super computer[J].Communications of the CCF,2017,13(10):16-22.
[15]ANDERSON W K,GROPP W D,KAUSHIK D K,et al.Achieving High Sustained Performance in an Unstructured Mesh CFD Application[C]∥Conference on High Performance Computing (Super Computing).1999.
[16]林恒.基于超大规模异构体系结构的图计算系统研究[D].北京:清华大学,2018:23-53.

相关文章 15

[1]	叶跃进, 李芳, 陈德训, 郭恒, 陈鑫. 基于国产众核架构的非结构网格分区块重构预处理算法研究 Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture 计算机科学, 2022, 49(6): 73-80. https://doi.org/10.11896/jsjkx.210900045
[2]	陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[3]	朱雨, 庞建民, 徐金龙, 陶小涵, 王军. 面向SW26010处理器的三维Stencil自适应分块参数算法 Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor 计算机科学, 2021, 48(6): 10-18. https://doi.org/10.11896/jsjkx.200700059
[4]	何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵. 基于神威平台的Floyd并行算法的实现和优化 Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform 计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051
[5]	李雨蓉, 刘杰, 刘亚林, 龚春叶, 王勇. 面向语音分离的深层转导式非负矩阵分解并行算法 Parallel Algorithm of Deep Transductive Non-negative Matrix Factorization for Speech Separation 计算机科学, 2020, 47(8): 49-55. https://doi.org/10.11896/jsjkx.190900202
[6]	袁欣辉, 林蓉芬, 魏迪, 尹万旺, 徐金秀. 面向国产异构众核处理器SW26010的BFS优化方法 Optimization of BFS on Domestic Heterogeneous Many-core Processor SW26010 计算机科学, 2020, 47(8): 98-104. https://doi.org/10.11896/jsjkx.191000013
[7]	吕小敬, 刘钊, 褚学森, 石树鹏, 孟虹松, 黄震春. 面向超大规模并行模拟的LBM计算流体力学软件 Extreme-scale Simulation Based LBM Computing Fluid Dynamics Simulations 计算机科学, 2020, 47(4): 13-17. https://doi.org/10.11896/jsjkx.191000010
[8]	李芳,李志辉,徐金秀,范昊,褚学森,李新亮. 基于十亿亿次国产超算系统的流体力学软件众核适应性研究 Research on Adaptation of CFD Software Based on Many-core Architecture of 100P Domestic Supercomputing System 计算机科学, 2020, 47(1): 24-30. https://doi.org/10.11896/jsjkx.181102176
[9]	陶小涵, 庞建民, 高伟, 王琦, 姚金阳. 基于SW26010处理器的FT程序的性能优化 Performance Optimization of FT Program Based on SW26010 Processor 计算机科学, 2019, 46(4): 321-328. https://doi.org/10.11896/j.issn.1002-137X.2019.04.050
[10]	朱超, 吴素萍. 并行Harris特征点检测算法 Parallel Harris Feature Point Detection Algorithm 计算机科学, 2019, 46(11A): 289-293.
[11]	周杰,李文敬. 基于三层混合编程模型的Petri网并行算法研究 Research on Parallel Algorithm of Petri Net Based on Three-layer Mixed Programming Model 计算机科学, 2017, 44(Z11): 586-591. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.126
[12]	唐兵,Laurent BOBELIN,贺海武. 基于MPI和OpenMP混合编程的非负矩阵分解并行算法 Parallel Algorithm of Nonnegative Matrix Factorization Based on Hybrid MPI and OpenMP Programming Model 计算机科学, 2017, 44(3): 51-54. https://doi.org/10.11896/j.issn.1002-137X.2017.03.013
[13]	石嵩,宁永波,李宏亮,郑方. 阵列众核结构上的一种多层分区Hash连接算法 Multi-layer Partition Hash Join Algorithm on Array-based Manycore Architecture 计算机科学, 2016, 43(3): 18-22. https://doi.org/10.11896/j.issn.1002-137X.2016.03.003
[14]	唐家维,王晓峰. 基于GPU的并行化Apriori算法的设计与实现 Design and Implementation of Apriori on GPU 计算机科学, 2014, 41(10): 238-243. https://doi.org/10.11896/j.issn.1002-137X.2014.10.050
[15]	毕硕本,陈东祺,颜坚,郭忆. 基于二维凸壳的平面点集Delaunay三角网算法 Planar Delaunay Triangulation Algorithm Based on 2D Convex Hull 计算机科学, 2014, 41(10): 317-320. https://doi.org/10.11896/j.issn.1002-137X.2014.10.066

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed