Computer Science ›› 2022, Vol. 49 ›› Issue (5): 363-370.doi: 10.11896/jsjkx.220100119

• Interdiscipline & Frontier • Previous Articles     Next Articles

Parallelization and Locality Optimization for Red-Black Gauss-Seidel Stencil

JI Ying-rui1, YUAN Liang2, ZHANG Yun-quan2   

  1. 1 School of Information Science and Engineering,Dalian Ocean University,Dalian,Liaoning 116023,China
    2 State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2022-01-05 Revised:2022-02-12 Online:2022-05-15 Published:2022-05-06
  • About author:JI Ying-rui,born in 1996,postgraduate.is a member of China Computer Federation.Her main research interests include parallel computational models,algorithms and program optimizations.
    YUAN Liang,born in 1984,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include parallel computational models,algorithms and program optimizations.
  • Supported by:
    National Natural Science Foundation of China(61972376,62072431,62032023).

Abstract: Stencil is a common cyclic nested computing model,which is widely used in many scientific and engineering simulation applications,such as computational electromagnetism,weather simulation,geophysics,ocean simulation and so on.With the deve-lopment of modern processor architecture,the multi-core and multi-layer memory levels have been deepened.Research on paralle-lism and locality is the main way to improve the performance of programs.Blocking is one of the main techniques to exploit data locality and program parallelism.At present,a large number of blocking methods have been proposed for Stencil,but most of them are limited to Jacobi Stencils which is featured with high parallelism and locality.Gauss-Seidel Stencil has a better convergence rate and is widely used in multi-grid calculations.However,the data dependence of this type of Stencil is more complicated.In this paper,a parallel blocking and vectorization algorithm is designed for Gauss-Seidel Stencil for red black sorting,which improves the data locality,medium granularity multi-core parallelism and intra core fine-grained parallelism of Gauss-Seidel Stencil.Experimental results demonstrate the effectiveness of this scheme.

Key words: Blocking, Gauss-Seidel, Stencil computation

CLC Number: 

  • TP301
[1]WOLF M E,LAM M S.A data locality optimizing algorithm[C]//Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation.1991:30-44.
[2]SONG Y,LI Z.New tiling techniques to improve cache temporal locality[J].ACM SIGPLAN Notices,1999,34(5):215-228.
[3]RIVERA G,TSENG C W.Tiling optimizations for 3D scientific computations[C]//Proceedings of the 2000 ACM/IEEE Confe-rence on Supercomputing(SC’00).IEEE,2000:32.
[4]MENG J,SKADRON K.Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs[C]//Proceedings of the 23rd International Conference on Supercomputing.2009:256-265.
[5]RASTELLO F,DAUXOIS T.Efficient tiling for an ode discrete integration program:Redundant tasks instead of trapezoidal shaped-tiles[C]//Proceedings of the IEEE IPDPS’02.Pisca-taway,NJ:IEEE,2002:138.
[6]NGUYEN A,SATISH N,CHHUGANI J,et al.3.5-D blocking optimization for stencil computations on modern CPUs and GPUs[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Sto-rage and Analysis(SC’10).IEEE,2010:1-13.
[7]DING C,HE Y.A ghost cell expansion method for reducingcommunications in solving PDE problems[C]//Proceedings of the 2001 ACM/IEEE Conference on Supercomputing(SC’01).IEEE,2001:55-55.
[8]JIN G,MELLOR-CRUMMEY J,FOWLER R.Increasing temporal locality with skewing and recursive blocking[C]//Proceedings of the 2001 ACM/IEEE Conference on Supercomputing(SC’01).IEEE,2001:57-57.
[9]WONNACOTT D.Achieving scalable locality with time ske-wing[J].International Journal of Parallel Programming,2002,30(3):181-221.
[10]MALAS T,HAGER G,LTAIEF H,et al.Multicore-optimized wavefront diamond blocking for optimizing stencil updates[J].SIAM Journal on Scientific Computing,2015,37(4):C439-C464.
[11]WELLEIN G,HAGER G,ZEISER T,et al.Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization[C]//2009 33rd Annual IEEE International Computer Software and Applications Conference.IEEE,2009:579-586.
[12]BONDHUGULA U,HARTONO A,RAMANUJAM J,et al.A practical automatic polyhedral parallelizer and locality optimizer[C]//Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation.2008:101-113.
[13]BANDISHTI V,PANANILATH I,BONDHUGULA U.Tiling stencil computations to maximize parallelism[C]//Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis(SC’12).IEEE,2012:1-11.
[14]PANANILATH I,ACHARYA A,VASISTA V,et al.An optimizing code generator for a class of lattice-boltzmann computations[J].ACM Transactions on Architecture and Code Optimization (TACO),2015,12(2):1-23.
[15]GROSSER T,VERDOOLAEGE S,COHEN A,et al.The relation between diamond tiling and hexagonal tiling[J].Parallel Processing Letters,2014,24(3):1441002.
[16]FRIGO M,STRUMPEN V.Cache oblivious stencil computa-tions[C]//Proceedings of the 19th Annual International Conference on Supercomputing.2005:361-366.
[17]FRIGO M,STRUMPEN V.The cache complexity of mul-tithreaded cache oblivious algorithms[J].Theory of Computing Systems,2009,45(2):203-233.
[18]STRZODKA R,SHAHEEN M,PAJAK D,et al.Cache obli-vious parallelograms in iterative stencil computations[C]//Proceedings of the 24th ACM International Conference on Supercomputing.2010:49-59.
[19]TANG Y,CHOWDHURY R A,KUSZMAUL B C,et al.The pochoir stencil compiler[C]//Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures.2011:117-128.
[20]KRISHNAMOORTHY S,BASKARAN M,BONDHUGULA U,et al.Effective automatic parallelization of stencil computations[J].ACM Sigplan Notices,2007,42(6):235-244.
[21]CUI H M,WANG L,FAN D R,et al.Landing stencil code on Godson-T[J].Journal of Computer Science and Technology,2010,25(4):886-894.
[22]WONNACOTT D.Using time skewing to eliminate idle timedue to memory bandwidth and network limitations[C]//Proceedings 14th International Parallel and Distributed Processing Symposium(IPDPS 2000).IEEE,2000:171-180.
[23]ZHOU X,GIACALONE J P,GARZARÁN M J,et al.Hierarchical overlapped tiling[C]//Proceedings of the Tenth International Symposium on Code Generation and Optimization.2012:207-218.
[24]GROSSER T,COHEN A,KELLY P H J,et al.Split tiling for GPUs:automatic parallelization using trapezoidal tiles[C]//Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units.2013:24-31.
[25]HENRETTY T,VERAS R,FRANCHETTI F,et al.A stencil compiler for short-vector simd architectures[C]//Proceedings of the 27th International ACM Conference on International Confe-rence on Supercomputing.2013:13-24.
[26]STRZODKA R,SHAHEEN M,PAJAK D,et al.Cache accurate time skewing in iterative stencil computations[C]//2011 International Conference on Parallel Processing.IEEE,2011:571-581.
[27]GROSSER T,COHEN A,HOLEWINSKI J,et al.Hybrid hexagonal/classical tiling for GPUs[C]//Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization.2014:66-75.
[28]YUAN L,ZHANG Y,GUO P,et al.Tessellating stencils[C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.2017:1-13.
[1] TIAN Bing-chuan, TIAN Chen, ZHOU Yu-hang, CHEN Gui-hai, DOU Wan-chun. Reducing Head-of-Line Blocking on Network in Hadoop Clusters [J]. Computer Science, 2022, 49(3): 11-22.
[2] ZHAO Hui-qun, WU Kai-feng. Big Data Valuation Algorithm [J]. Computer Science, 2020, 47(9): 110-116.
[3] TAO Yang,JI Rui-juan,YANG Li,WANG Jin. Study on Dynamic Priority Admission Control Algorithm in Heterogeneous Wireless Networks [J]. Computer Science, 2020, 47(3): 242-247.
[4] CHEN Dai-bin and YANG Xiao-mei. Block-coded Video Deblocking Based on Low-rank Tensor Recovery [J]. Computer Science, 2016, 43(9): 280-283.
[5] HU Bo and TAN Liang. Optimization of Communication Performance of RPC Client in HBase Architecture [J]. Computer Science, 2016, 43(4): 97-101.
[6] WANG Yue, ZHOU Cheng, XIONG Cheng-yi and SHU Zhen-yu. Enhanced Block Compressed Sensing of Images Based on Total Variation Using Texture Information [J]. Computer Science, 2016, 43(2): 307-310.
[7] PANG Ren-bo, ZHANG Yun-quan, TAN Guang-ming, XU Jian-liang, JIA Hai-peng and XIE Qing-chun. Parallelization of Hydrostatic Numerical Forecasting Model of Marginal Sea [J]. Computer Science, 2016, 43(1): 14-17.
[8] FAN Wen-chao, LI Xiao-yu, WEI Kai and CHEN Xing-lin. Moving Target Detection Based on Improved Gaussian Mixture Model [J]. Computer Science, 2015, 42(5): 286-288.
[9] WU Gui-ming, WANG Miao, XIE Xiang-hui, DOU Yong and GUO Song. Sparse Matrix Blocking Method for Custom Architecture [J]. Computer Science, 2015, 42(11): 63-64.
[10] TAN Ming-chao,DIAO Xing-chun and CAO Jian-jun. Survey on Entity Resolution [J]. Computer Science, 2014, 41(4): 9-12.
[11] TIAN Hong-jin and ZHAN Yin-wei. Moving Object Detection Based on Adaptive Image Blocking and SSIM [J]. Computer Science, 2014, 41(2): 119-122.
[12] ZHANG Qi-liang and CHEN Yong-sheng. Bi-directional Blocking Job Shop Model and Particle Swarm Optimization Algorithm for Train Scheduling Problem on Single-track Lines [J]. Computer Science, 2013, 40(12): 276-281.
[13] . Communication Optimization Algorithm Using Reordering Transformation and Loop Distribution [J]. Computer Science, 2012, 39(9): 296-301.
[14] . Pipeline Hardware Design and Optimization for H.264 Deblocking Filter [J]. Computer Science, 2011, 38(12): 288-292.
[15] WANG Xin,LU Zhi-bo. Locate Tampered Image Region Automatically Based on Inconsistency of ,JPEG Blocking Artifacts [J]. Computer Science, 2010, 37(2): 269-273.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!