计算机科学 ›› 2019, Vol. 46 ›› Issue (6A): 518-522.
倪鸿, 刘鑫
NI Hong, LIU Xin
摘要: 稀疏下三角方程求解器(SpTRSV)作为基础线性代数库中一个重要的算法,在大规模科学计算中有着广泛应用。在非结构网格中,由于非结构网格具有数据存储无序性、数据强相关性以及频繁地离散访存等特点,该算法在众核架构上难以实现有效的并行。文中基于国产异构众核处理器SW26010体系结构的特点,针对非结构网格计算,提出了一种基于流水线串行-局部并行思想的通用众核优化方法。该方法能够有效减少非结构网格计算中的随机访存,提高计算效率,并且具有很好的扩展性。基于该算法对多个实际应用算例进行众核优化,实验结果表明:该方法能够实现单核组3倍以上的加速,显著降低了运行时间。
中图分类号:
[1]ANDERSON E,SAAD Y.Solving sparse triangular linear systems on parallel computers[J].International Journal of High Speed Computing,1989,1(1):73-95. [2]DUFF I S,ERISMAN A M,REID J K.Direct Methods for Sparse Matrices[J].Matehematics of Computation,1986,52(185):250. [3]SAAD Y.Iterative methods for sparse linear systems[OL].http://ieeexplore.ieee.org/ielx5/99/12132/001231631.pdf. [4]REGULY I Z,MUDALIGE G R,BERTOLLI C,et al.Acceleration of a full-scale industrial cfd application with op2.IEEE Transactionson Parallel and Distributed Systems,2016,27(5):1265-1278. [5]YAO C,YANG X,WANG W,et al.26 pflops stencil computations for atmospheric modeling on sunway taihulight[C]∥2017 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).IEEE,2017:535-544. [6]CHEN T,LI M,LI Y,et al.Mxnet:A flexible and efficient machine learning library for heterogeneous distributed systems[J].arXiv preprint arXiv:1512.01274,2015. [7]ANDERSON E,SAAD Y.Solving sparse triangular linear systems on parallel computers[J].International Journal of High Speed Computing,1989,1(1):73-95. [8]KALIKAR S,NASRE R.Domlock:A new multi-granularity locking technique for hierarchies[C]∥Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,2016. [9]SCHREIBER R,TANG W P.Vectorizing the conjugate gradient method[D].Los Angeles:Stanford University,1982. [10]LIU W,LI A,HOGG J D,et al.Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides[J].Concurrency and Computation:Practice and Experience,2017,29(21):e4244-n/a. [11]PICCIAU A,INGGS G E,WICKERSON J,et al.Balancing locality and concurrency:Solving sparse triangular systems on gpus[C]∥2016 IEEE 23rd International Conference on High Performance Computing(HiPC).2016:183-192. [12]敖玉龙.国产大型众核系统上稀疏矩阵和Stencil运算的性能优化关键技术研究[D].北京:中科院软件研究所,2017:27-47. [13]王欣亮.关键稀疏数值计算核心在国产众核架构上的性能优化研究[D].北京:清华大学,2018:24-43. [14]QI F B.Sunway TaihuLight super computer[J].Communications of the CCF,2017,13(10):16-22. [15]ANDERSON W K,GROPP W D,KAUSHIK D K,et al.Achieving High Sustained Performance in an Unstructured Mesh CFD Application[C]∥Conference on High Performance Computing (Super Computing).1999. [16]林恒.基于超大规模异构体系结构的图计算系统研究[D].北京:清华大学,2018:23-53. |
[1] | 叶跃进, 李芳, 陈德训, 郭恒, 陈鑫. 基于国产众核架构的非结构网格分区块重构预处理算法研究 Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture 计算机科学, 2022, 49(6): 73-80. https://doi.org/10.11896/jsjkx.210900045 |
[2] | 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157 |
[3] | 朱雨, 庞建民, 徐金龙, 陶小涵, 王军. 面向SW26010处理器的三维Stencil自适应分块参数算法 Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor 计算机科学, 2021, 48(6): 10-18. https://doi.org/10.11896/jsjkx.200700059 |
[4] | 何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵. 基于神威平台的Floyd并行算法的实现和优化 Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform 计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051 |
[5] | 李雨蓉, 刘杰, 刘亚林, 龚春叶, 王勇. 面向语音分离的深层转导式非负矩阵分解并行算法 Parallel Algorithm of Deep Transductive Non-negative Matrix Factorization for Speech Separation 计算机科学, 2020, 47(8): 49-55. https://doi.org/10.11896/jsjkx.190900202 |
[6] | 袁欣辉, 林蓉芬, 魏迪, 尹万旺, 徐金秀. 面向国产异构众核处理器SW26010的BFS优化方法 Optimization of BFS on Domestic Heterogeneous Many-core Processor SW26010 计算机科学, 2020, 47(8): 98-104. https://doi.org/10.11896/jsjkx.191000013 |
[7] | 吕小敬, 刘钊, 褚学森, 石树鹏, 孟虹松, 黄震春. 面向超大规模并行模拟的LBM计算流体力学软件 Extreme-scale Simulation Based LBM Computing Fluid Dynamics Simulations 计算机科学, 2020, 47(4): 13-17. https://doi.org/10.11896/jsjkx.191000010 |
[8] | 李芳,李志辉,徐金秀,范昊,褚学森,李新亮. 基于十亿亿次国产超算系统的流体力学软件众核适应性研究 Research on Adaptation of CFD Software Based on Many-core Architecture of 100P Domestic Supercomputing System 计算机科学, 2020, 47(1): 24-30. https://doi.org/10.11896/jsjkx.181102176 |
[9] | 陶小涵, 庞建民, 高伟, 王琦, 姚金阳. 基于SW26010处理器的FT程序的性能优化 Performance Optimization of FT Program Based on SW26010 Processor 计算机科学, 2019, 46(4): 321-328. https://doi.org/10.11896/j.issn.1002-137X.2019.04.050 |
[10] | 朱超, 吴素萍. 并行Harris特征点检测算法 Parallel Harris Feature Point Detection Algorithm 计算机科学, 2019, 46(11A): 289-293. |
[11] | 周杰,李文敬. 基于三层混合编程模型的Petri网并行算法研究 Research on Parallel Algorithm of Petri Net Based on Three-layer Mixed Programming Model 计算机科学, 2017, 44(Z11): 586-591. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.126 |
[12] | 唐兵,Laurent BOBELIN,贺海武. 基于MPI和OpenMP混合编程的非负矩阵分解并行算法 Parallel Algorithm of Nonnegative Matrix Factorization Based on Hybrid MPI and OpenMP Programming Model 计算机科学, 2017, 44(3): 51-54. https://doi.org/10.11896/j.issn.1002-137X.2017.03.013 |
[13] | 石嵩,宁永波,李宏亮,郑方. 阵列众核结构上的一种多层分区Hash连接算法 Multi-layer Partition Hash Join Algorithm on Array-based Manycore Architecture 计算机科学, 2016, 43(3): 18-22. https://doi.org/10.11896/j.issn.1002-137X.2016.03.003 |
[14] | 唐家维,王晓峰. 基于GPU的并行化Apriori算法的设计与实现 Design and Implementation of Apriori on GPU 计算机科学, 2014, 41(10): 238-243. https://doi.org/10.11896/j.issn.1002-137X.2014.10.050 |
[15] | 毕硕本,陈东祺,颜坚,郭忆. 基于二维凸壳的平面点集Delaunay三角网算法 Planar Delaunay Triangulation Algorithm Based on 2D Convex Hull 计算机科学, 2014, 41(10): 317-320. https://doi.org/10.11896/j.issn.1002-137X.2014.10.066 |
|