计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 43-48.doi: 10.11896/jsjkx.201200129

• 计算机体系结构* 上一篇    下一篇

基于DGX-2的湍流燃烧问题优化研究

文敏华1, 汪申鹏1, 韦建文1, 李林颖2, 张斌2, 林新华1   

  1. 1 上海交通大学高性能计算中心 上海200240
    2 上海交通大学航空航天学院 上海200240
  • 收稿日期:2020-12-14 修回日期:2021-04-09 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 林新华(james@sjtu.edu.cn)
  • 作者简介:james@sjtu.edu.cn
  • 基金资助:
    国家重点研发计划(2016YFB0201800)

DGX-2 Based Optimization of Application for Turbulent Combustion

WEN Min-hua1, WANG Shen-peng1, WEI Jian-wen1, LI Lin-ying2, ZHANG Bin2, LIN Xin-hua1   

  1. 1 Center for High Performance Computing,Shanghai Jiao Tong University,Shanghai 200240,China
    2 School of Aeronautics and Astronautics,Shanghai Jiao Tong University,Shanghai 200240,China
  • Received:2020-12-14 Revised:2021-04-09 Online:2021-12-15 Published:2021-11-26
  • About author:WEN Min-hua,born in 1988,associate engineer,is a member of China Compu-ter Federation.His main research in-terests include engineering computing and so on.
    LIN Xin-hua,born in 1979,senior en-gineer,is a member of China Computer Federation.His main research interests include performance modeling and optimization.
  • Supported by:
    National Key Research and Development Program of China(2016YFB0201800).

摘要: 湍流燃烧问题的数值模拟是航空发动机设计的关键工具。由于需要使用高精度计算模型求解NS方程,湍流燃烧的数值模拟需要庞大的计算量,而物理化学模型的引入则导致流场极为复杂,使得计算域内的负载平衡问题成为大规模并行计算的瓶颈。为此文中将湍流燃烧的数值模拟方法在单台具有强大计算能力的服务器——DGX-2上进行移植和优化,设计了通量计算的线程分配方式,并以Roofline模型为工具分析指导了实际的优化方向。此外,还设计了高效的数据通信方式,并结合DGX-2的高速互联实现了湍流燃烧数值模拟方法的多GPU并行版本。实验结果表明,相较于双路Intel Xeon 6248 CPU 40核心的并行版本,迭代过程的计算部分在单块V100上获得了8.1倍的性能提升,在DGX-2共16块V100上达到了66.1倍的加速,优于CPU并行版本所能达到的最高性能。

关键词: CUDA, DGX-2, NS方程, 湍流燃烧

Abstract: Numerical simulation of turbulent combustion is a key tool for aeroengine design.Due to the need of high-precision model to Navier-Stokes equation,numerical simulation of turbulent combustion requires huge amount of calculations,and the phy-sicochemical models causes the flow field to be extremely complicated,making the load balancing a bottleneck for large-scale pa-rallelization.We port and optimize the numerical simulation method of turbulent combustion on a powerful computing server,DGX-2.We design the threading method of flux calculation and use Roofline model to guide the optimization.In addition,we design an efficient communication method and propose a multi-GPU parallel method for turbulent combustion based on high-speed interconnection of DGX-2.The results show that the performance of a single V100 GPU is 8.1x higher than that on dual-socket Intel Xeon 6248 CPU node with 40 cores.And the multi-GPU version on DGX-2 with 16 V100 GPUs achieves 66.1x speedup,which is higher than the best performance on CPU cluster.

Key words: CUDA, DGX-2, Navier-Stokes equation, Turbulent combustion

中图分类号: 

  • TP311.1
[1]WU C.Study on applicability of turbulent combustion model in the numerical calculation of combustor[D].Shenyang:Shenyang Institute of Aeronautical Engineering,2009.
[2]MOIN P,MAHESH K.Direct numerical simulation:a tool in turbulence research[J].Annual Review of Fluid Mechanics,1998,30(1):539-578.
[3]PITSCH H.Large-eddy simulation of turbulent combustion[J].Annu. Rev. Fluid Mech.,2006,38:453-482.
[4]KRÜGER J,WESTERMANN R.Linear algebra operators for GPU implementation of numerical algorithms[M]//ACM SIGGRAPH 2005 Courses.2005:234-242.
[5]GOODNIGHT N,WOOLLEY C,LEWIN G,et al.A multigrid solver for boundary value problems using programmable grap-hics hardware[M]//ACM SIGGRAPH 2005 Courses.2005:193-203.
[6]AISSA M,VERSTRAETE T,VUIK C.Toward a GPU-aware comparison of explicit and implicit CFD simulations on structured meshes[J].Computers & Mathematics with Applications,2017,74(1):201-217.
[7]PHILLIPS E,ZHANG Y,DAVIS R,et al.Rapid aerodynamic performance prediction on a cluster of graphics processing units[C]//47th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition.2009:565.
[8]JACOBSEN D,THIBAULT J,SENOCAK I.An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters[C]//48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition.2010:522.
[9]BOLZ J,FARMER I,GRINSPUN E,et al.Sparse matrix sol- vers on the GPU:conjugate gradients and multigrid[J].ACM Transactions on Graphics(TOG),2003,22(3):917-924.
[10]CORRIGAN A,CAMELLI F F,LÖHNER R,et al.Running unstructured grid-based CFD solvers on modern graphics hardware[J].International Journal for Numerical Methods in Fluids,2011,66(2):221-229.
[11]NGUYEN M T,CASTONGUAY P,LAURENDEAU E.GPU parallelization of multigrid RANS solver for three-dimensional aerodynamic simulations on multiblock grids[J].The Journal of Supercomputing,2019,75(5):2562-2583.
[12]OYARZUN G,CHALMOUKIS I A,LEFTHERIOTIS G A,et al.A GPU-based algorithm for efficient LES of high Reynolds number flows in heterogeneous CPU/GPU supercomputers[J].Applied Mathematical Modelling,2020,85:141-156.
[13]LI A,SONG S L,CHEN J,et al.Evaluating modern gpu interconnect:Pcie,nvlink,nv-sli,nvswitch and gpudirect[J].IEEE Transactions on Parallel and Distributed Systems,2019,31(1):94-110.
[14]WILLIAMS S,WATERMAN A,PATTERSON D.Roofline:an insightful visual performance model for multicore architectures[J].Communications of the ACM,2009,52(4):65-76.
[15]BUTCHER J C.On the implementation of implicit Runge-Kutta methods[J].BIT Numerical Mathematics,1976,16(3):237-240.
[16]ZHONG X.Additive semi-implicit Runge-Kutta methods for computing high-speed nonequilibrium reactive flows[J].Journal of Computational Physics,1996,128(1):19-31.
[17]THIBAULT J,SENOCAK I.CUDA implementation of a Na- vier-Stokes solver on multi-GPU desktop platforms for incompressible flows[C]//47th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Sxposition.2009:758.
[1] 汪晋, 刘江.
基于GPU的并行DILU预处理技术
GPU-based Parallel DILU Preconditioning Technique
计算机科学, 2022, 49(6): 108-118. https://doi.org/10.11896/jsjkx.210300259
[2] 汪亮, 周新志, 严华.
基于GPU的实时SIFT算法
Real-time SIFT Algorithm Based on GPU
计算机科学, 2020, 47(8): 105-111. https://doi.org/10.11896/jsjkx.190700036
[3] 许新鹏, 胡斌星.
基于ICCG法的飞行器部件强度校核快速计算方法
Fast Calculation Method of Aircraft Component Strength Check Based on ICCG
计算机科学, 2020, 47(11A): 624-627. https://doi.org/10.11896/jsjkx.191100154
[4] 郑红波, 石豪, 杜轶诚, 张美玉, 秦绪佳.
光照不均匀的结构光图像的条纹快速提取方法
Fast Stripe Extraction Method for Structured Light Images with Uneven Illumination
计算机科学, 2019, 46(5): 272-278. https://doi.org/10.11896/j.issn.1002-137X.2019.05.042
[5] 张劼,文敏华,林新华,孟德龙,陆豪.
基于历史模拟法的风险价值算法在GPU上的实现和优化
Implementation and Optimization of Historical VaR on GPU
计算机科学, 2018, 45(5): 291-294. https://doi.org/10.11896/j.issn.1002-137X.2018.05.050
[6] 周筠, 蒋富.
基于CUDA架构的改进Marching Cubes算法
Improved Marching Cubes Based on CUDA
计算机科学, 2018, 45(11A): 573-575.
[7] 刘端阳, 郑江帆, 沈国江, 刘志.
基于CUDA的k-means算法并行化研究
Study on Parallel K-means Algorithm Based on CUDA
计算机科学, 2018, 45(11): 292-297. https://doi.org/10.11896/j.issn.1002-137X.2018.11.047
[8] 武昱, 闫光辉, 王雅斐, 马青青, 刘宇轩.
结合GPU技术的并行CP张量分解算法
Parallel CP Tensor Decomposition Algorithm Combining with GPU Technology
计算机科学, 2018, 45(11): 298-303. https://doi.org/10.11896/j.issn.1002-137X.2018.11.048
[9] 徐启航,游安清,马社,崔云俊.
基本图像处理算法的优化过程研究
Study on Optimizations of Basic Image Processing Algorithm
计算机科学, 2017, 44(Z6): 169-172. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.039
[10] 沈洪,李晓光.
图像显著估计的并行算法研究
Research on Parallel Algorithm of Image Saliency Estimation
计算机科学, 2017, 44(12): 266-273. https://doi.org/10.11896/j.issn.1002-137X.2017.12.048
[11] 韦博文,李涛,李广宇,汪致恒,何沐,师悦龄,刘路遥,张瑞.
使用OpenCL技术的影像快速畸变纠正方法在异构平台上的应用分析
Applied Analysis of Image Accelerating Distortion Correction of OpenCL Technology on Heterogeneous Platform
计算机科学, 2016, 43(Z11): 167-169. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.036
[12] 潘茜,张育平,陈海燕.
基于CUDA的并行K-近邻连接算法实现
Implementation of Parallel K-Nearest Neighbor Join Algorithm Based on CUDA
计算机科学, 2016, 43(10): 190-192. https://doi.org/10.11896/j.issn.1002-137X.2016.10.035
[13] 张杰,柴志雷,喻津.
基于GPU的图像特征并行计算方法
Parallel Computation Method of Image Features Based on GPU
计算机科学, 2015, 42(10): 297-300.
[14] 余莹,李肯立,郑光勇.
一种基于GPU集群的深度优先并行算法设计与实现
Implementation of Depth First Search Parallel Algorithm on Cluster of GPUs
计算机科学, 2015, 42(1): 82-85. https://doi.org/10.11896/j.issn.1002-137X.2015.01.019
[15] 阳王东,李肯立,石林.
一种准对角矩阵的混合压缩算法及其与向量相乘在GPU上的实现
Quasi-diagonal Matrix Hybrid Compression Algorithm and Implementation for SpMV on GPU
计算机科学, 2014, 41(7): 290-296. https://doi.org/10.11896/j.issn.1002-137X.2014.07.060
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!