Computer Science ›› 2021, Vol. 48 ›› Issue (6): 10-18.doi: 10.11896/jsjkx.200700059

• Computer Architecture • Previous Articles     Next Articles

Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor

ZHU Yu, PANG Jian-min, XU Jin-long, TAO Xiao-han, WANG Jun   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing,PLA Information Engineering University,Zhengzhou 450000,China
  • Received:2020-07-09 Revised:2020-08-30 Online:2021-06-15 Published:2021-06-03
  • About author:ZHU Yu,born in 1998,postgraduate.Her main research interests include high-performance computing and so on.(i_zhuyu@126.com)
    XU Jin-long,born in 1985,Ph.D,lectu-rer.His main research interests include high-performance computing and computer architecture.
  • Supported by:
    Major Scientific Project of Zhejiang Lab Advanced Industrial Network Security Platform(2018FD0ZX01).

Abstract: Stencil computation is an important part of scientific computing and large-scale applications.Tiling is a widely-used technique to explore the data locality of Stencil computation.In the existing methods of 3D Stencil optimization on SW26010,time tiling is rarely used and manual tuning is needed for tiling size.To solve this problem,this paper introduces time tiling method and proposes an adaptive tiling size algorithm for 3D Stencil computation on SW26010 many-core processor.By establishing a performance analysis model,we systematically analyze the influence of tiling size to the performance of 3D Stencil computation,identify the performance bottleneck and guide the optimization direction under the hardware resource constraints.Based on the performance analysis model,the adaptive tiling size algorithm provides the predicted optimal tiling size,which can be helpful to deploy 3D Stencil rapidly on SW26010 processor.3D-7P Stencil and 3D-27P Stencil are selected for experiment.Compared with the result lacking of time tiling,the speedup rates of the above two examples with optimal tiling size given by our algorithm can reach 1.47 and 1.29,and the optimal tiling size in experiment is consistent with that given by our model,which verify the proposed performance analysis model and tiling size adaptive algorithm.

Key words: 3D Stencil computing, Performance analysis model, SW26010, Tiling size

CLC Number: 

  • TP391
[1]RIVERA G,TSENG C W.Tiling optimizations for 3D scientific computations [C]//Proceedings of the 2000 ACM/IEEE Confe-rence on Supercomputing(SC’00).Piscataway:IEEE,2000:32.
[2]YUAN L,ZHANG Y,GUO P,et al.Tessellating stencils [C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.New York:ACM,2017:1-13.
[3]WIJESINGHE T,SENEVIRATHNE K,SIRIWARDHANA C,et al.Parameterized Diamond Tiling for Parallelizing stencil computations [C]//2017 Moratuwa Engineering Research Conference (MERCon).Piscataway:IEEE,2017:99-104.
[4]GUO P,YUAN L,ZHANG Y,et al. Parallel Stencil Algorithm Based on Tessellating [J].Journal of Frontiers of Computer Science and Technology,2019,13(2):181-194.
[5]MURANUSHI T,MAKINO J.Optimal Temporal Blocking for Stencil Computation [J].Procedia Computer Science,2015,51:1303-1312.
[6]WELLEIN G,HAGER G,ZEISER T,et al.Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization [C]//2009 33rd Annual IEEE International Computer Software and Applications Conference.Piscataway:IEEE,2009:579-586.
[7]FRIGO M,STRUMPEN V.Cache oblivious stencil computa-tions [C]//Proceedings of the 19th Annual International Confe-rence on Supercomputing.New York:ACM,2005:361-366.
[8]KRISHNAMOORTHY S,BASKARAN M,BONDHUGULAU,et al.Effective automatic parallelization of stencil computations [C]//Programming Language Design and Implementation.New York:ACM,2007:235-244.
[9]NGUYEN A,SATISH N,CHHUGANI J,et al.3.5-D blocking optimization for stencil computations on modern CPUs and GPUs [C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Sto-rage and Analysis(SC’10).Piscataway:IEEE,2010:1-13.
[10]RAWAT P S,HONG C,RAVISHANKAR M,et al.Effectiveresource management for enhancing performance of 2D and 3D stencils on GPUs [C]//Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit(GPGPU ’16).New York:ACM,2016:92-102.
[11]SZUSTAK L,ROJEK K,OLAS T,et al. Adaptation of MPDATA Heterogeneous Stencil Computation to Intel Xeon Phi Coprocessor [J].Scientific Programming,2015(5):1-14.
[12]WILLIAMS S,SHALF J,OLIKER L,et al. Scientific computing kernels on the cell processor [J].International Journal of Parallel Programming,2007,35(3):263-298.
[13]AO Y,YANG C,WANG X,et al.26 pflops stencil computations for atmospheric modeling on sunway taihulight [C]//2017 IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS).Piscataway:IEEE,2017:535-544.
[14]MENG J,SKADRON K.Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs [C]//International Conference on Supercomputing.New York:ACM,2009:256-265.
[15]RAWAT P S,VAIDYA M,SUKUMARAN-RAJAM A,et al. Domain-Specific Optimization and Generation of High-Perfor-mance GPU Code for Stencil Computations [J].Proceedings of the IEEE,2018,106(11):1902-1920 .
[16]DATTA K,MURPHY M,VOLKOV V,et al.Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures [C]//Proceedings of the 2008 ACM/IEEE Con-ference on Supercomputing(SC’08).Piscataway:IEEE,2008:1-12.
[17]MATSUMURA K,ZOHOURI H R,WAHIB M,et al.AN5D:automated stencil framework for high-degree temporal blocking on GPUs [C]//Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization.New York:ACM,2020 .
[18]UNAT D,CAI X,BADEN S B.Mint:realizing CUDA perfor-mance in 3D stencil methods with annotated C [C]//International Conference on Supercomputing.New York:ACM,2011:214-224.
[19]BONDHUGULA U,BANDISHTI V,PANANILATH I.Dia-mond Tiling:Tiling Techniques to Maximize Parallelism for Stencil Computations [J].IEEE Transactions on Parallel and Distributed Systems,2017,28(5):1285-1298.
[20]WU Q,NI Y,HUANG X.Regional Ocean Model Parallel Optimization in “Sunway TaihuLight” [J].Journal of Computer Research and Development,2019,56(7):1556-1566.
[21]LV X J,LIU Z,CHU X S,et al. Extreme-scale SimulationBased LBM Computing Fluid Dynamics Simulations [J].Computer Science,2020,47(4):13-17.
[22]SAIDI S,TENDULKAR P,LEPLEY T,et al. Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms [J].Microprocessors Microsystems,2013,37(8):848-857.
[23]XU Z,LIN J,MATSUOKA S.Benchmarking sw26010 many-core processor [C]//2017 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).Piscata-way:IEEE,2017:743-752.
[1] HE Ya-ru, PANG Jian-min, XU Jin-long, ZHU Yu, TAO Xiao-han. Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform [J]. Computer Science, 2021, 48(6): 34-40.
[2] LIU Xiao-nan, JING Li-na, WANG Li-xin, WANG Mei-ling. Large-scale Quantum Fourier Transform Simulation Based on SW26010 [J]. Computer Science, 2020, 47(8): 93-97.
[3] YUAN Xin-hui, LIN Rong-fen, WEI Di, YIN Wan-wang, XU Jin-xiu. Optimization of BFS on Domestic Heterogeneous Many-core Processor SW26010 [J]. Computer Science, 2020, 47(8): 98-104.
[4] LV Xiao-jing, LIU Zhao, CHU Xue-sen, SHI Shu-peng, MENG Hong-song, HUANG Zhen-chun. Extreme-scale Simulation Based LBM Computing Fluid Dynamics Simulations [J]. Computer Science, 2020, 47(4): 13-17.
[5] NI Hong, LIU Xin. Many-core Optimization for Sparse Triangular Solver Under Unstructured Grids [J]. Computer Science, 2019, 46(6A): 518-522.
[6] TAO Xiao-han, PANG Jian-min, GAO Wei, WANG Qi, YAO Jin-yang. Performance Optimization of FT Program Based on SW26010 Processor [J]. Computer Science, 2019, 46(4): 321-328.
[7] YIN Meng-jia, XU Xian-bin, XIONG Zeng-gang and ZHANG Tao. Quantitative Performance Analysis Model of Matrix Multiplication Based on GPU [J]. Computer Science, 2015, 42(12): 13-17.
[8] JIN Gang,WANG Lei,WANG Zhi-ying. Performance Evaluation Method for Asynchronous Circuit Based on Static Data Flow Structure [J]. Computer Science, 2009, 36(12): 231-234.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!