计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 32-40.doi: 10.11896/jsjkx.200500093

所属专题: 高性能计算

• 高性能计算 • 上一篇    下一篇

用数据驱动的编程模型并行多重网格应用

郭杰1, 高希然2, 陈莉2, 傅游1, 刘颖2   

  1. 1 山东科技大学计算机科学与工程学院 山东 青岛266590
    2 中国科学院计算技术研究所计算机体系结构国家重点实验室 北京100190
  • 出版日期:2020-08-15 发布日期:2020-08-10
  • 通讯作者: 陈莉(lchen@ict.ac.cn)
  • 作者简介:17854258663@163.com
  • 基金资助:
    国家自然科学基金(61521092);国家重点研发计划(2016YFB0200803);山东省重点研发计划(2019GGX101066)

Parallelizing Multigrid Application Using Data-driven Programming Model

GUO Jie1, GAO Xi-ran2, CHEN Li2, FU You1, LIU Ying2,   

  1. 1 College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, China
    2 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:GUO Jie, born in 1996, postgraduate.His main research interests includeparal-lel optimization and parallel compilation.
    CHEN Li, born in 1970, Ph.D, associate professor, is a member of China Computer Federation.Her main research interests include parallel programming languages and parallelizing compiling techniques.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61521092), National Key R&D Program of China(2016YFB0200803) and Key R&D Project of Shandong Province(2019GGX101066).

摘要: 多重网格是数值计算领域中一种加速迭代收敛的重要技术, 被广泛应用。近年来, 大规模并行计算系统向多核化、异构众核化发展, 多重网格应用也亟须适应新的并行计算平台。文中采用一种数据驱动的任务并行语言AceMesh将遗产的NAS MG程序移植到“天河二号”和“神威·太湖之光”两种不同架构的国产超算平台上, 展示了使用该语言对计算循环、通信代码的任务并行方法, 验证了AceMesh语言的跨平台性能可移植性。文中定性地分析了该应用的任务图特征和计算-通信重叠的特点, 并分别在两个并行计算平台上将其与现有编程模型MPI/OpenMP和MPI/OpenACC进行性能对比, 分析了AceMesh任务图并行程序对访存性能和通信-计算重叠的优化效果。实验数据表明, 相比传统的并行编程方法, AceMesh在“神威·太湖之光”和“天河二号”平台上分别最高获得了1.19X和1.85X的性能加速。最后, 针对该应用在不同网格层的通信特点以及通信序列化导致大量通信不能隐藏的问题, 提出了未来的研究方向。

关键词: MPI遗产应用, 多重网格, 计算-通信重叠, 数据驱动的任务并行编程模型, 异构众核

Abstract: Multigrid is an important family of algorithms to accelerate the convergence of iterative solvers for linear systems, and it plays an important role in large-scale scientific computing.At present, distributed-memory systems have evolved to large scale systems based on multi-core nodes or heterogeneous nodes with accelerators.Legacy applications face the urgent need to be ported to modern supercomputers with diverse node-level architectures.In this paper, a data-driven programming language, AceMesh is introduced, and using this directive language, NAS MG is ported to two home-made supercomputers which are Tianhe-2 and Sunway TaihuLight supercomputer.This paper shows how to taskify computation loops and communication-related codes in AceMesh, and analyzes the characteristics on its task graph and on its computation-communication overlapping.Experimental results show that compared with traditional programming models, the AceMesh versions achieve relative speedup up to 1.19X and 1.85X on Sunway TaihuLight and Tianhe-2 respectively.Analyses show that performance improvements come from two main reasons, memory-related optimization and communication overlapping optimization.At last, future directions are put forward to further optimize inter-process communications for the AceMesh version.

Key words: Computation-communication overlap, Data-driven task parallel programming model, Heterogeneous many-core, MPI legacy application, Multigrid

中图分类号: 

  • TP311
[1] BRANDT A.Multiscale computational methods:research activities[C]∥Proceedings of 1991 Hang Zhou International Conf.on Scientific Computation.Singapore:World Scientific Publishing Co., 1992.
[2] BRANDT A.Multi-Level Adaptive Solutions to Boundary-ValueProblems.Mathematics of Computation, 1977, 31(138):333-390.
[3] HACKBUSCH W.Multi-Grid Methods and Applications.Heidelberg:Springer, 1985.
[4] NAKAJIMA K.Optimization of serial and parallel communications for parallel geometric multigrid method∥Proceedings of IEEE International Conference on Parallel and Distributed Systems(ICPADS).Hsinchu, Taiwan, 2014:25-32.
[5] LIU X Z, LU Z H, HU X D, et al.Large-scale Parallel CFD Simulation Software-CCFD Development and Application[C]∥HPC China 2019.2019.
[6] LEI J, LIU W, ZHOU Y L, et al.CFD unsteady flow simulations using GPU with high-order schemes[C]∥HPC China 2019.2019.
[7] WANG W, XU C F, CHE Y G.A Heterogeneous Parallel Algorithm Based on Inner-Out Subdomain Dividing for High Order CFD Solver[C]∥HPC China 2019.2019.
[8] NVIDIA, the Portland Group.The openacc application programming interface.http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.698.5254&rep=rep1&type=pdf.
[9] OpenMP Architecture Review Board.OpenMP Application Program Interface(Version 4.0).http://www.openmp.org/.
[10] DURAN A, AYGUADE E, BADIA R M, et al.OmpSs:A Proposal for Programming Heterogeneous Multi-core Architectures[J].Parallel Processing Letters, 2011, 21(2):173-193.
[11] AUGONNET C, THIBAULT S, NAMYST R, et al.StarPU:A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.Concurrency and Computation-Practice &Experience, 2011, 23(2):187-198.
[12] Intel Inc.Intel CilkTM Plus.https://www.cilkplus.org.
[13] Intel Inc.Intel Threading Building Blocks Documentation.https://software.intel.com/en-us/node/506286.
[14] BRIGGS W L, EMDEN H V, MCCORMICK S F.A Multigrid Tutorial, 2nd Edition.Society for Industrial and Applied Mathematics, 2000.
[15] WAGNER C.Introduction to Algebraic Multigrid.http://www.mgnet.org/mgnet/papers/Wagner/amgV11.pdf.
[16] BAILEY D H, BARSZCZ E, BARTON J T, et al.The NAS Parallel Benchmarks.https://www.nas.nasa.gov/assets/pdf/techreports/1994/rnr-94-007.pdf.
[17] XU Z, LIN J, MATSUOKA S.Benchmarking SW26010 Many-Core Processor[C]∥IEEE International Parallel & Distributed Processing Symposium Workshops.IEEE, 2017.
[18] FU H H, LIAO J F, YANG J Z, et al.The Sunway Taihu Light supercomputer:system and applications.Science China(Information Sciences), 2016, 59(7):113-128.
[19] LI F, LI Z H, XU J X, et al.Research on Adaptation of CFD Software Based on Many-core Architecture of 100P Domestic Supercomputing System.Computer Science, 2020, 47(1):24-30.
[20] BASU P, VENKAT A, HALL M, et al.Compiler generation and autotuning of communication-avoiding operators for geometric multigrid[C]∥High Performance Computing.2013:452-461.
[21] CHAN C, ANSEL J, WONG Y L, et al.Autotuning multigrid with petabricks[C]∥Proceedings of the ACM/IEEE Conference on High Performance Computing Networking.New York:ACM, 2009.
[22] CHRISTEN M, SCHENK O, BURKHART H.PATUS:A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures[C]∥Parallel & Distributed Processing Symposium(IPDPS) 2011 IEEE International.2011:676-687.
[23] MARJANOVIC V, LABARTA J, AYGUADE E, et al.Overlapping communication and computation by using a hybrid MPI/SMPSs approach[C]∥Proceedings of the 24th ACM International Conference on Supercomputing.2010:5-16.
[24] CASTILLO E, JAIN N, CASAS M, et al.Optimizing computation-communication overlap in asynchronous task-based programs[C]∥Proceedings of the ACM International Conference on Supercomputing(ICS ’19).New York:Association for Computing Machinery, 2019:380-391.
[1] 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香.
面向国产异构众核架构的CFD非结构网格计算并行优化方法
Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture
计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[2] 袁欣辉, 林蓉芬, 魏迪, 尹万旺, 徐金秀.
面向国产异构众核处理器SW26010的BFS优化方法
Optimization of BFS on Domestic Heterogeneous Many-core Processor SW26010
计算机科学, 2020, 47(8): 98-104. https://doi.org/10.11896/jsjkx.191000013
[3] 倪鸿, 刘鑫.
非结构网格下稀疏下三角方程求解器众核优化技术研究
Many-core Optimization for Sparse Triangular Solver Under Unstructured Grids
计算机科学, 2019, 46(6A): 518-522.
[4] 程东升,刘志勇,薛国伟,高月芳.
一种针对大波数Helmholtz方程的高性能并行预条件迭代求解算法
High-performance Parallel Preconditioned Iterative Solver for Helmholtz Equation with Large Wavenumbers
计算机科学, 2018, 45(7): 299-306. https://doi.org/10.11896/j.issn.1002-137X.2018.07.051
[5] 顾坚,刘伟.
面向NUMA集群的代数多重网格算法优化
Optimizing Algebraic Multigrid on NUMA-based Cluster System
计算机科学, 2014, 41(6): 113-118. https://doi.org/10.11896/j.issn.1002-137X.2014.06.023
[6] 许瑾晨,郭绍忠,黄永忠,王磊.
面向异构众核从核的数学函数库访存优化方法
Access Optimization Technique for Mathematical Library of Slave Processors on Heterogeneous Many-core Architectures
计算机科学, 2014, 41(6): 12-17. https://doi.org/10.11896/j.issn.1002-137X.2014.06.003
[7] 杜振龙,李晓丽,郭延文,杨小健,沈钢纲.
大尺度图像编辑的泊松方程并行多重网格求解算法
Parallel Multigrid Approach for Solving Poisson PDE in Gigapixel Image Editing
计算机科学, 2013, 40(3): 59-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!