计算机科学 ›› 2021, Vol. 48 ›› Issue (6): 10-18.doi: 10.11896/jsjkx.200700059

• 计算机体系结构* • 上一篇    下一篇

面向SW26010处理器的三维Stencil自适应分块参数算法

朱雨, 庞建民, 徐金龙, 陶小涵, 王军   

  1. 数学工程与先进计算国家重点实验室(信息工程大学) 郑州450000
  • 收稿日期:2020-07-09 修回日期:2020-08-30 出版日期:2021-06-15 发布日期:2021-06-03
  • 通讯作者: 徐金龙(longkaizh@126.com)
  • 基金资助:
    之江实验室重大科研基金资助项目(2018FD0ZX01)

Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor

ZHU Yu, PANG Jian-min, XU Jin-long, TAO Xiao-han, WANG Jun   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing,PLA Information Engineering University,Zhengzhou 450000,China
  • Received:2020-07-09 Revised:2020-08-30 Online:2021-06-15 Published:2021-06-03
  • About author:ZHU Yu,born in 1998,postgraduate.Her main research interests include high-performance computing and so on.(i_zhuyu@126.com)
    XU Jin-long,born in 1985,Ph.D,lectu-rer.His main research interests include high-performance computing and computer architecture.
  • Supported by:
    Major Scientific Project of Zhejiang Lab Advanced Industrial Network Security Platform(2018FD0ZX01).

摘要: Stencil计算是科学应用中的一类重要计算,而分块是提升Stencil计算数据局部性的关键技术。针对现有三维Stencil优化在SW26010处理器上缺少时间分块以及分块参数需手工调优的问题,引入时间分块,提出了面向SW26010处理器的三维Stencil自适应分块参数算法。通过建立性能分析模型,结合硬件计算能力及存储容量等限制因素,文中系统地分析了分块参数对模型性能的影响,判断性能瓶颈,指导分块参数的优化方向。基于性能分析模型,自适应分块参数算法可给出预测性能最优时的分块参数,有利于三维Stencil在SW26010处理器上的快速优化部署。选取了三维7点和三维27点Stencil算例进行实验。与未使用时间分块的三维Stencil优化相比,以上两个算例在自适应选择的分块参数下可以达到1.47和1.29的加速比,且实际最优分块参数与理论最佳分块参数一致,这验证了所提性能分析模型及自适应分块参数算法的有效性。

关键词: SW26010, 分块大小, 三维Stencil计算, 性能分析模型

Abstract: Stencil computation is an important part of scientific computing and large-scale applications.Tiling is a widely-used technique to explore the data locality of Stencil computation.In the existing methods of 3D Stencil optimization on SW26010,time tiling is rarely used and manual tuning is needed for tiling size.To solve this problem,this paper introduces time tiling method and proposes an adaptive tiling size algorithm for 3D Stencil computation on SW26010 many-core processor.By establishing a performance analysis model,we systematically analyze the influence of tiling size to the performance of 3D Stencil computation,identify the performance bottleneck and guide the optimization direction under the hardware resource constraints.Based on the performance analysis model,the adaptive tiling size algorithm provides the predicted optimal tiling size,which can be helpful to deploy 3D Stencil rapidly on SW26010 processor.3D-7P Stencil and 3D-27P Stencil are selected for experiment.Compared with the result lacking of time tiling,the speedup rates of the above two examples with optimal tiling size given by our algorithm can reach 1.47 and 1.29,and the optimal tiling size in experiment is consistent with that given by our model,which verify the proposed performance analysis model and tiling size adaptive algorithm.

Key words: 3D Stencil computing, Performance analysis model, SW26010, Tiling size

中图分类号: 

  • TP391
[1]RIVERA G,TSENG C W.Tiling optimizations for 3D scientific computations [C]//Proceedings of the 2000 ACM/IEEE Confe-rence on Supercomputing(SC’00).Piscataway:IEEE,2000:32.
[2]YUAN L,ZHANG Y,GUO P,et al.Tessellating stencils [C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.New York:ACM,2017:1-13.
[3]WIJESINGHE T,SENEVIRATHNE K,SIRIWARDHANA C,et al.Parameterized Diamond Tiling for Parallelizing stencil computations [C]//2017 Moratuwa Engineering Research Conference (MERCon).Piscataway:IEEE,2017:99-104.
[4]GUO P,YUAN L,ZHANG Y,et al. Parallel Stencil Algorithm Based on Tessellating [J].Journal of Frontiers of Computer Science and Technology,2019,13(2):181-194.
[5]MURANUSHI T,MAKINO J.Optimal Temporal Blocking for Stencil Computation [J].Procedia Computer Science,2015,51:1303-1312.
[6]WELLEIN G,HAGER G,ZEISER T,et al.Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization [C]//2009 33rd Annual IEEE International Computer Software and Applications Conference.Piscataway:IEEE,2009:579-586.
[7]FRIGO M,STRUMPEN V.Cache oblivious stencil computa-tions [C]//Proceedings of the 19th Annual International Confe-rence on Supercomputing.New York:ACM,2005:361-366.
[8]KRISHNAMOORTHY S,BASKARAN M,BONDHUGULAU,et al.Effective automatic parallelization of stencil computations [C]//Programming Language Design and Implementation.New York:ACM,2007:235-244.
[9]NGUYEN A,SATISH N,CHHUGANI J,et al.3.5-D blocking optimization for stencil computations on modern CPUs and GPUs [C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Sto-rage and Analysis(SC’10).Piscataway:IEEE,2010:1-13.
[10]RAWAT P S,HONG C,RAVISHANKAR M,et al.Effectiveresource management for enhancing performance of 2D and 3D stencils on GPUs [C]//Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit(GPGPU ’16).New York:ACM,2016:92-102.
[11]SZUSTAK L,ROJEK K,OLAS T,et al. Adaptation of MPDATA Heterogeneous Stencil Computation to Intel Xeon Phi Coprocessor [J].Scientific Programming,2015(5):1-14.
[12]WILLIAMS S,SHALF J,OLIKER L,et al. Scientific computing kernels on the cell processor [J].International Journal of Parallel Programming,2007,35(3):263-298.
[13]AO Y,YANG C,WANG X,et al.26 pflops stencil computations for atmospheric modeling on sunway taihulight [C]//2017 IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS).Piscataway:IEEE,2017:535-544.
[14]MENG J,SKADRON K.Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs [C]//International Conference on Supercomputing.New York:ACM,2009:256-265.
[15]RAWAT P S,VAIDYA M,SUKUMARAN-RAJAM A,et al. Domain-Specific Optimization and Generation of High-Perfor-mance GPU Code for Stencil Computations [J].Proceedings of the IEEE,2018,106(11):1902-1920 .
[16]DATTA K,MURPHY M,VOLKOV V,et al.Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures [C]//Proceedings of the 2008 ACM/IEEE Con-ference on Supercomputing(SC’08).Piscataway:IEEE,2008:1-12.
[17]MATSUMURA K,ZOHOURI H R,WAHIB M,et al.AN5D:automated stencil framework for high-degree temporal blocking on GPUs [C]//Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization.New York:ACM,2020 .
[18]UNAT D,CAI X,BADEN S B.Mint:realizing CUDA perfor-mance in 3D stencil methods with annotated C [C]//International Conference on Supercomputing.New York:ACM,2011:214-224.
[19]BONDHUGULA U,BANDISHTI V,PANANILATH I.Dia-mond Tiling:Tiling Techniques to Maximize Parallelism for Stencil Computations [J].IEEE Transactions on Parallel and Distributed Systems,2017,28(5):1285-1298.
[20]WU Q,NI Y,HUANG X.Regional Ocean Model Parallel Optimization in “Sunway TaihuLight” [J].Journal of Computer Research and Development,2019,56(7):1556-1566.
[21]LV X J,LIU Z,CHU X S,et al. Extreme-scale SimulationBased LBM Computing Fluid Dynamics Simulations [J].Computer Science,2020,47(4):13-17.
[22]SAIDI S,TENDULKAR P,LEPLEY T,et al. Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms [J].Microprocessors Microsystems,2013,37(8):848-857.
[23]XU Z,LIN J,MATSUOKA S.Benchmarking sw26010 many-core processor [C]//2017 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).Piscata-way:IEEE,2017:743-752.
[1] 何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵.
基于神威平台的Floyd并行算法的实现和优化
Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform
计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051
[2] 袁欣辉, 林蓉芬, 魏迪, 尹万旺, 徐金秀.
面向国产异构众核处理器SW26010的BFS优化方法
Optimization of BFS on Domestic Heterogeneous Many-core Processor SW26010
计算机科学, 2020, 47(8): 98-104. https://doi.org/10.11896/jsjkx.191000013
[3] 吕小敬, 刘钊, 褚学森, 石树鹏, 孟虹松, 黄震春.
面向超大规模并行模拟的LBM计算流体力学软件
Extreme-scale Simulation Based LBM Computing Fluid Dynamics Simulations
计算机科学, 2020, 47(4): 13-17. https://doi.org/10.11896/jsjkx.191000010
[4] 倪鸿, 刘鑫.
非结构网格下稀疏下三角方程求解器众核优化技术研究
Many-core Optimization for Sparse Triangular Solver Under Unstructured Grids
计算机科学, 2019, 46(6A): 518-522.
[5] 陶小涵, 庞建民, 高伟, 王琦, 姚金阳.
基于SW26010处理器的FT程序的性能优化
Performance Optimization of FT Program Based on SW26010 Processor
计算机科学, 2019, 46(4): 321-328. https://doi.org/10.11896/j.issn.1002-137X.2019.04.050
[6] 晋钢,王蕾,王志英.
异步电路的静态数据流图模型及其性能分析
Performance Evaluation Method for Asynchronous Circuit Based on Static Data Flow Structure
计算机科学, 2009, 36(12): 231-234.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!