计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 43-48.doi: 10.11896/jsjkx.201200129

• 计算机体系结构* 上一篇    下一篇

基于DGX-2的湍流燃烧问题优化研究

文敏华1, 汪申鹏1, 韦建文1, 李林颖2, 张斌2, 林新华1   

  1. 1 上海交通大学高性能计算中心 上海200240
    2 上海交通大学航空航天学院 上海200240
  • 收稿日期:2020-12-14 修回日期:2021-04-09 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 林新华(james@sjtu.edu.cn)
  • 作者简介:james@sjtu.edu.cn
  • 基金资助:
    国家重点研发计划(2016YFB0201800)

DGX-2 Based Optimization of Application for Turbulent Combustion

WEN Min-hua1, WANG Shen-peng1, WEI Jian-wen1, LI Lin-ying2, ZHANG Bin2, LIN Xin-hua1   

  1. 1 Center for High Performance Computing,Shanghai Jiao Tong University,Shanghai 200240,China
    2 School of Aeronautics and Astronautics,Shanghai Jiao Tong University,Shanghai 200240,China
  • Received:2020-12-14 Revised:2021-04-09 Online:2021-12-15 Published:2021-11-26
  • About author:WEN Min-hua,born in 1988,associate engineer,is a member of China Compu-ter Federation.His main research in-terests include engineering computing and so on.
    LIN Xin-hua,born in 1979,senior en-gineer,is a member of China Computer Federation.His main research interests include performance modeling and optimization.
  • Supported by:
    National Key Research and Development Program of China(2016YFB0201800).

摘要: 湍流燃烧问题的数值模拟是航空发动机设计的关键工具。由于需要使用高精度计算模型求解NS方程,湍流燃烧的数值模拟需要庞大的计算量,而物理化学模型的引入则导致流场极为复杂,使得计算域内的负载平衡问题成为大规模并行计算的瓶颈。为此文中将湍流燃烧的数值模拟方法在单台具有强大计算能力的服务器——DGX-2上进行移植和优化,设计了通量计算的线程分配方式,并以Roofline模型为工具分析指导了实际的优化方向。此外,还设计了高效的数据通信方式,并结合DGX-2的高速互联实现了湍流燃烧数值模拟方法的多GPU并行版本。实验结果表明,相较于双路Intel Xeon 6248 CPU 40核心的并行版本,迭代过程的计算部分在单块V100上获得了8.1倍的性能提升,在DGX-2共16块V100上达到了66.1倍的加速,优于CPU并行版本所能达到的最高性能。

关键词: 湍流燃烧, NS方程, DGX-2, CUDA

Abstract: Numerical simulation of turbulent combustion is a key tool for aeroengine design.Due to the need of high-precision model to Navier-Stokes equation,numerical simulation of turbulent combustion requires huge amount of calculations,and the phy-sicochemical models causes the flow field to be extremely complicated,making the load balancing a bottleneck for large-scale pa-rallelization.We port and optimize the numerical simulation method of turbulent combustion on a powerful computing server,DGX-2.We design the threading method of flux calculation and use Roofline model to guide the optimization.In addition,we design an efficient communication method and propose a multi-GPU parallel method for turbulent combustion based on high-speed interconnection of DGX-2.The results show that the performance of a single V100 GPU is 8.1x higher than that on dual-socket Intel Xeon 6248 CPU node with 40 cores.And the multi-GPU version on DGX-2 with 16 V100 GPUs achieves 66.1x speedup,which is higher than the best performance on CPU cluster.

Key words: Turbulent combustion, Navier-Stokes equation, DGX-2, CUDA

中图分类号: 

  • TP311.1
[1]WU C.Study on applicability of turbulent combustion model in the numerical calculation of combustor[D].Shenyang:Shenyang Institute of Aeronautical Engineering,2009.
[2]MOIN P,MAHESH K.Direct numerical simulation:a tool in turbulence research[J].Annual Review of Fluid Mechanics,1998,30(1):539-578.
[3]PITSCH H.Large-eddy simulation of turbulent combustion[J].Annu. Rev. Fluid Mech.,2006,38:453-482.
[4]KRÜGER J,WESTERMANN R.Linear algebra operators for GPU implementation of numerical algorithms[M]//ACM SIGGRAPH 2005 Courses.2005:234-242.
[5]GOODNIGHT N,WOOLLEY C,LEWIN G,et al.A multigrid solver for boundary value problems using programmable grap-hics hardware[M]//ACM SIGGRAPH 2005 Courses.2005:193-203.
[6]AISSA M,VERSTRAETE T,VUIK C.Toward a GPU-aware comparison of explicit and implicit CFD simulations on structured meshes[J].Computers & Mathematics with Applications,2017,74(1):201-217.
[7]PHILLIPS E,ZHANG Y,DAVIS R,et al.Rapid aerodynamic performance prediction on a cluster of graphics processing units[C]//47th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition.2009:565.
[8]JACOBSEN D,THIBAULT J,SENOCAK I.An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters[C]//48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition.2010:522.
[9]BOLZ J,FARMER I,GRINSPUN E,et al.Sparse matrix sol- vers on the GPU:conjugate gradients and multigrid[J].ACM Transactions on Graphics(TOG),2003,22(3):917-924.
[10]CORRIGAN A,CAMELLI F F,LÖHNER R,et al.Running unstructured grid-based CFD solvers on modern graphics hardware[J].International Journal for Numerical Methods in Fluids,2011,66(2):221-229.
[11]NGUYEN M T,CASTONGUAY P,LAURENDEAU E.GPU parallelization of multigrid RANS solver for three-dimensional aerodynamic simulations on multiblock grids[J].The Journal of Supercomputing,2019,75(5):2562-2583.
[12]OYARZUN G,CHALMOUKIS I A,LEFTHERIOTIS G A,et al.A GPU-based algorithm for efficient LES of high Reynolds number flows in heterogeneous CPU/GPU supercomputers[J].Applied Mathematical Modelling,2020,85:141-156.
[13]LI A,SONG S L,CHEN J,et al.Evaluating modern gpu interconnect:Pcie,nvlink,nv-sli,nvswitch and gpudirect[J].IEEE Transactions on Parallel and Distributed Systems,2019,31(1):94-110.
[14]WILLIAMS S,WATERMAN A,PATTERSON D.Roofline:an insightful visual performance model for multicore architectures[J].Communications of the ACM,2009,52(4):65-76.
[15]BUTCHER J C.On the implementation of implicit Runge-Kutta methods[J].BIT Numerical Mathematics,1976,16(3):237-240.
[16]ZHONG X.Additive semi-implicit Runge-Kutta methods for computing high-speed nonequilibrium reactive flows[J].Journal of Computational Physics,1996,128(1):19-31.
[17]THIBAULT J,SENOCAK I.CUDA implementation of a Na- vier-Stokes solver on multi-GPU desktop platforms for incompressible flows[C]//47th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Sxposition.2009:758.
[1] 汪亮, 周新志, 严华. 基于GPU的实时SIFT算法[J]. 计算机科学, 2020, 47(8): 105-111.
[2] 许新鹏, 胡斌星. 基于ICCG法的飞行器部件强度校核快速计算方法[J]. 计算机科学, 2020, 47(11A): 624-627.
[3] 郑红波, 石豪, 杜轶诚, 张美玉, 秦绪佳. 光照不均匀的结构光图像的条纹快速提取方法[J]. 计算机科学, 2019, 46(5): 272-278.
[4] 张劼,文敏华,林新华,孟德龙,陆豪. 基于历史模拟法的风险价值算法在GPU上的实现和优化[J]. 计算机科学, 2018, 45(5): 291-294.
[5] 周筠, 蒋富. 基于CUDA架构的改进Marching Cubes算法[J]. 计算机科学, 2018, 45(11A): 573-575.
[6] 刘端阳, 郑江帆, 沈国江, 刘志. 基于CUDA的k-means算法并行化研究[J]. 计算机科学, 2018, 45(11): 292-297.
[7] 武昱, 闫光辉, 王雅斐, 马青青, 刘宇轩. 结合GPU技术的并行CP张量分解算法[J]. 计算机科学, 2018, 45(11): 298-303.
[8] 徐启航,游安清,马社,崔云俊. 基本图像处理算法的优化过程研究[J]. 计算机科学, 2017, 44(Z6): 169-172.
[9] 沈洪,李晓光. 图像显著估计的并行算法研究[J]. 计算机科学, 2017, 44(12): 266-273.
[10] 韦博文,李涛,李广宇,汪致恒,何沐,师悦龄,刘路遥,张瑞. 使用OpenCL技术的影像快速畸变纠正方法在异构平台上的应用分析[J]. 计算机科学, 2016, 43(Z11): 167-169.
[11] 潘茜,张育平,陈海燕. 基于CUDA的并行K-近邻连接算法实现[J]. 计算机科学, 2016, 43(10): 190-192.
[12] 张杰,柴志雷,喻津. 基于GPU的图像特征并行计算方法[J]. 计算机科学, 2015, 42(10): 297-300.
[13] 余莹,李肯立,郑光勇. 一种基于GPU集群的深度优先并行算法设计与实现[J]. 计算机科学, 2015, 42(1): 82-85.
[14] 阳王东,李肯立,石林. 一种准对角矩阵的混合压缩算法及其与向量相乘在GPU上的实现[J]. 计算机科学, 2014, 41(7): 290-296.
[15] 刘金硕,曾秋梅,邹斌,江庄毅,邓娟. 快速鲁棒特征算法的CUDA加速优化[J]. 计算机科学, 2014, 41(4): 24-27.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周剑云,王丽珍,杨增芳. 基于加权欧氏距离的空间Co-location模式挖掘算法研究[J]. 计算机科学, 2014, 41(Z6): 425 -428 .
[2] 杜威, 丁世飞. 多智能体强化学习综述[J]. 计算机科学, 2019, 46(8): 1 -8 .
[3] 王岩, 罗倩, 邓辉. 基于变分贝叶斯的轴承故障诊断方法[J]. 计算机科学, 2019, 46(11): 323 -327 .
[4] 胡昕彤, 沙朝锋, 刘艳君. 基于随机投影和主成分分析的网络嵌入后处理算法[J]. 计算机科学, 2021, 48(5): 124 -129 .
[5] 潘孝勤, 芦天亮, 杜彦辉, 仝鑫. 基于深度学习的语音合成与转换技术综述[J]. 计算机科学, 2021, 48(8): 200 -208 .
[6] 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究[J]. 计算机科学, 2021, 48(9): 36 -42 .
[7] 余力, 杜启翰, 岳博妍, 向君瑶, 徐冠宇, 冷友方. 基于强化学习的推荐研究综述[J]. 计算机科学, 2021, 48(10): 1 -18 .
[8] 王梓强, 胡晓光, 李晓筱, 杜卓群. 移动机器人全局路径规划算法综述[J]. 计算机科学, 2021, 48(10): 19 -29 .
[9] 高洪皓, 郑子彬, 殷昱煜, 丁勇. 区块链技术专题序言[J]. 计算机科学, 2021, 48(11): 1 -3 .
[10] 毛瀚宇, 聂铁铮, 申德荣, 于戈, 徐石成, 何光宇. 区块链即服务平台关键技术及发展综述[J]. 计算机科学, 2021, 48(11): 4 -11 .