计算机科学 ›› 2022, Vol. 49 ›› Issue (5): 363-370.doi: 10.11896/jsjkx.220100119

• 交叉与前沿 • 上一篇    下一篇

红黑Gauss-Seidel Stencil并行性和局部性优化

纪璎芮1, 袁良2, 张云泉2   

  1. 1 大连海洋大学信息工程学院 辽宁 大连 116023
    2 中国科学院计算技术研究所计算机体系结构国家重点实验室 北京100190
  • 收稿日期:2022-01-05 修回日期:2022-02-12 出版日期:2022-05-15 发布日期:2022-05-06
  • 通讯作者: 袁良(yuanliang@ict.ac.cn)
  • 作者简介:(jiyingrui1996@gmail.com)
  • 基金资助:
    国家自然科学基金(61972376,62072431,62032023)

Parallelization and Locality Optimization for Red-Black Gauss-Seidel Stencil

JI Ying-rui1, YUAN Liang2, ZHANG Yun-quan2   

  1. 1 School of Information Science and Engineering,Dalian Ocean University,Dalian,Liaoning 116023,China
    2 State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2022-01-05 Revised:2022-02-12 Online:2022-05-15 Published:2022-05-06
  • About author:JI Ying-rui,born in 1996,postgraduate.is a member of China Computer Federation.Her main research interests include parallel computational models,algorithms and program optimizations.
    YUAN Liang,born in 1984,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include parallel computational models,algorithms and program optimizations.
  • Supported by:
    National Natural Science Foundation of China(61972376,62072431,62032023).

摘要: Stencil(模版计算)是一类常见的循环嵌套计算模式,被广泛应用于计算电磁、天气模拟、地球物理、海洋模拟等许多科学和工程模拟应用中。随着现代处理器体系结构的发展,多核和多层存储层次不断加深,研究并行性和局部性成为了提高程序运行速度的主要途径。分块是开发数据局部性和程序并行性的主要技术之一,目前,针对Stencil已提出了大量高效分块和向量化方法,但大多局限于具有较高并行度的Jacobi 类型的Stencil。Gauss-Seidel Stencil具有更优的收敛速度,被广泛应用于多重网格的计算中。这类Stencil的数据依赖更为复杂,文中面向红黑排序的Gauss-Seidel Stencil设计了一种并行分块和向量化算法,提升了Gauss-Seidel Stencil的数据局部性、中粒度多核并行性以及核内细粒度并行性。实验结果证实了本文方案的有效性。

关键词: Gauss-Seidel, Stencil计算, 分块

Abstract: Stencil is a common cyclic nested computing model,which is widely used in many scientific and engineering simulation applications,such as computational electromagnetism,weather simulation,geophysics,ocean simulation and so on.With the deve-lopment of modern processor architecture,the multi-core and multi-layer memory levels have been deepened.Research on paralle-lism and locality is the main way to improve the performance of programs.Blocking is one of the main techniques to exploit data locality and program parallelism.At present,a large number of blocking methods have been proposed for Stencil,but most of them are limited to Jacobi Stencils which is featured with high parallelism and locality.Gauss-Seidel Stencil has a better convergence rate and is widely used in multi-grid calculations.However,the data dependence of this type of Stencil is more complicated.In this paper,a parallel blocking and vectorization algorithm is designed for Gauss-Seidel Stencil for red black sorting,which improves the data locality,medium granularity multi-core parallelism and intra core fine-grained parallelism of Gauss-Seidel Stencil.Experimental results demonstrate the effectiveness of this scheme.

Key words: Blocking, Gauss-Seidel, Stencil computation

中图分类号: 

  • TP301
[1]WOLF M E,LAM M S.A data locality optimizing algorithm[C]//Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation.1991:30-44.
[2]SONG Y,LI Z.New tiling techniques to improve cache temporal locality[J].ACM SIGPLAN Notices,1999,34(5):215-228.
[3]RIVERA G,TSENG C W.Tiling optimizations for 3D scientific computations[C]//Proceedings of the 2000 ACM/IEEE Confe-rence on Supercomputing(SC’00).IEEE,2000:32.
[4]MENG J,SKADRON K.Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs[C]//Proceedings of the 23rd International Conference on Supercomputing.2009:256-265.
[5]RASTELLO F,DAUXOIS T.Efficient tiling for an ode discrete integration program:Redundant tasks instead of trapezoidal shaped-tiles[C]//Proceedings of the IEEE IPDPS’02.Pisca-taway,NJ:IEEE,2002:138.
[6]NGUYEN A,SATISH N,CHHUGANI J,et al.3.5-D blocking optimization for stencil computations on modern CPUs and GPUs[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Sto-rage and Analysis(SC’10).IEEE,2010:1-13.
[7]DING C,HE Y.A ghost cell expansion method for reducingcommunications in solving PDE problems[C]//Proceedings of the 2001 ACM/IEEE Conference on Supercomputing(SC’01).IEEE,2001:55-55.
[8]JIN G,MELLOR-CRUMMEY J,FOWLER R.Increasing temporal locality with skewing and recursive blocking[C]//Proceedings of the 2001 ACM/IEEE Conference on Supercomputing(SC’01).IEEE,2001:57-57.
[9]WONNACOTT D.Achieving scalable locality with time ske-wing[J].International Journal of Parallel Programming,2002,30(3):181-221.
[10]MALAS T,HAGER G,LTAIEF H,et al.Multicore-optimized wavefront diamond blocking for optimizing stencil updates[J].SIAM Journal on Scientific Computing,2015,37(4):C439-C464.
[11]WELLEIN G,HAGER G,ZEISER T,et al.Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization[C]//2009 33rd Annual IEEE International Computer Software and Applications Conference.IEEE,2009:579-586.
[12]BONDHUGULA U,HARTONO A,RAMANUJAM J,et al.A practical automatic polyhedral parallelizer and locality optimizer[C]//Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation.2008:101-113.
[13]BANDISHTI V,PANANILATH I,BONDHUGULA U.Tiling stencil computations to maximize parallelism[C]//Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis(SC’12).IEEE,2012:1-11.
[14]PANANILATH I,ACHARYA A,VASISTA V,et al.An optimizing code generator for a class of lattice-boltzmann computations[J].ACM Transactions on Architecture and Code Optimization (TACO),2015,12(2):1-23.
[15]GROSSER T,VERDOOLAEGE S,COHEN A,et al.The relation between diamond tiling and hexagonal tiling[J].Parallel Processing Letters,2014,24(3):1441002.
[16]FRIGO M,STRUMPEN V.Cache oblivious stencil computa-tions[C]//Proceedings of the 19th Annual International Conference on Supercomputing.2005:361-366.
[17]FRIGO M,STRUMPEN V.The cache complexity of mul-tithreaded cache oblivious algorithms[J].Theory of Computing Systems,2009,45(2):203-233.
[18]STRZODKA R,SHAHEEN M,PAJAK D,et al.Cache obli-vious parallelograms in iterative stencil computations[C]//Proceedings of the 24th ACM International Conference on Supercomputing.2010:49-59.
[19]TANG Y,CHOWDHURY R A,KUSZMAUL B C,et al.The pochoir stencil compiler[C]//Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures.2011:117-128.
[20]KRISHNAMOORTHY S,BASKARAN M,BONDHUGULA U,et al.Effective automatic parallelization of stencil computations[J].ACM Sigplan Notices,2007,42(6):235-244.
[21]CUI H M,WANG L,FAN D R,et al.Landing stencil code on Godson-T[J].Journal of Computer Science and Technology,2010,25(4):886-894.
[22]WONNACOTT D.Using time skewing to eliminate idle timedue to memory bandwidth and network limitations[C]//Proceedings 14th International Parallel and Distributed Processing Symposium(IPDPS 2000).IEEE,2000:171-180.
[23]ZHOU X,GIACALONE J P,GARZARÁN M J,et al.Hierarchical overlapped tiling[C]//Proceedings of the Tenth International Symposium on Code Generation and Optimization.2012:207-218.
[24]GROSSER T,COHEN A,KELLY P H J,et al.Split tiling for GPUs:automatic parallelization using trapezoidal tiles[C]//Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units.2013:24-31.
[25]HENRETTY T,VERAS R,FRANCHETTI F,et al.A stencil compiler for short-vector simd architectures[C]//Proceedings of the 27th International ACM Conference on International Confe-rence on Supercomputing.2013:13-24.
[26]STRZODKA R,SHAHEEN M,PAJAK D,et al.Cache accurate time skewing in iterative stencil computations[C]//2011 International Conference on Parallel Processing.IEEE,2011:571-581.
[27]GROSSER T,COHEN A,HOLEWINSKI J,et al.Hybrid hexagonal/classical tiling for GPUs[C]//Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization.2014:66-75.
[28]YUAN L,ZHANG Y,GUO P,et al.Tessellating stencils[C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.2017:1-13.
[1] 左杰格, 柳晓鸣, 蔡兵.
基于图像分块与特征融合的户外图像天气识别
Outdoor Image Weather Recognition Based on Image Blocks and Feature Fusion
计算机科学, 2022, 49(3): 197-203. https://doi.org/10.11896/jsjkx.201200263
[2] 苏尔.
初等稳定矩阵约化A0为上Hessenberg型的方法研究
Research on Method of Reducing A0 to Upper Hessenberg Type with Elementary Stability Matrix
计算机科学, 2021, 48(6A): 649-657. https://doi.org/10.11896/jsjkx.200800063
[3] 朱雨, 庞建民, 徐金龙, 陶小涵, 王军.
面向SW26010处理器的三维Stencil自适应分块参数算法
Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor
计算机科学, 2021, 48(6): 10-18. https://doi.org/10.11896/jsjkx.200700059
[4] 钱心缘, 吴文渊.
基于R-SIS和R-LWE构建的IBE加密方案
Identity-based Encryption Scheme Based on R-SIS/R-LWE
计算机科学, 2021, 48(6): 315-323. https://doi.org/10.11896/jsjkx.200700215
[5] 赵会群, 吴凯锋.
一种大数据估价算法
Big Data Valuation Algorithm
计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156
[6] 池昊宇, 陈长波.
基于神经网络的循环分块大小预测
Prediction of Loop Tiling Size Based on Neural Network
计算机科学, 2020, 47(8): 62-70. https://doi.org/10.11896/jsjkx.191200180
[7] 喻露, 胡剑锋, 姚磊岳.
全局块与局部块协作的相关滤波目标跟踪算法
Correlation Filter Object Tracking Algorithm Based on Global and Local Block Cooperation
计算机科学, 2020, 47(6): 157-163. https://doi.org/10.11896/jsjkx.190500078
[8] 田伟, 刘浩, 陈根龙, 宫晓蕙.
面向分块压缩感知的交叉子集导引自适应观测
Cross Subset-guided Adaptive Measurement for Block Compressive Sensing
计算机科学, 2020, 47(12): 190-196. https://doi.org/10.11896/jsjkx.200800197
[9] 李春景, 胡静, 唐枝.
基于层次特征的自适应径向基插值图像放大的保真指标
Fidelity Index in Image Magnification Based on Hierarchical Feature and Radial Basis Function
计算机科学, 2019, 46(4): 254-260. https://doi.org/10.11896/j.issn.1002-137X.2019.04.040
[10] 陈莉莉, 朱峰, 盛斌, 陈志华.
基于离散四元数傅里叶变换的彩色图像质量评价
Quality Evaluation of Color Image Based on Discrete Quaternion Fourier Transform
计算机科学, 2018, 45(8): 70-74. https://doi.org/10.11896/j.issn.1002-137X.2018.08.012
[11] 王燕, 王双印.
基于卷积神经网络的人脸信息增强识别研究
Research on Face Information Enhancement and Recognition Based on Convolutional Neural Network
计算机科学, 2018, 45(8): 268-271. https://doi.org/10.11896/j.issn.1002-137X.2018.08.048
[12] 杜秀丽, 张薇, 顾斌斌, 陈波, 邱少明.
基于灰度共生矩阵的图像自适应分块压缩感知方法
GLCM-based Adaptive Block Compressed Sensing Method for Image
计算机科学, 2018, 45(8): 277-282. https://doi.org/10.11896/j.issn.1002-137X.2018.08.050
[13] 瞿中,郭阳,鞠芳蓉.
一种基于改进渗流模型的混凝土表面裂缝快速检测算法
Algorithm of Accelerated Cracks Detection Based on Improved Percolation Model in Concrete Surface Image
计算机科学, 2017, 44(1): 300-302. https://doi.org/10.11896/j.issn.1002-137X.2017.01.055
[14] 曹春红,张建华,李林峰.
基于双子代差分演化和自适应分块机制的多聚焦图像融合算法
Multi-focus Image Fusion Based on Twin-generation Differential Evolution and Adaptive Block Mechanism
计算机科学, 2016, 43(7): 67-72. https://doi.org/10.11896/j.issn.1002-137X.2016.07.011
[15] 王毅,金忠.
基于矩阵完整化的分块整合推荐算法
Split-Integration Recommendation Algorithm Based on Matrix Completion
计算机科学, 2016, 43(5): 219-222. https://doi.org/10.11896/j.issn.1002-137X.2016.05.040
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!