计算机科学 ›› 2017, Vol. 44 ›› Issue (10): 64-70.doi: 10.11896/j.issn.1002-137X.2017.10.012

• 2016 全国高性能计算学术年会 • 上一篇    下一篇

神威太湖之光上OpenFOAM的移植与优化

孟德龙,文敏华,韦建文,林新华   

  1. 上海交通大学高性能计算中心 上海200240,上海交通大学高性能计算中心 上海200240,上海交通大学高性能计算中心 上海200240,上海交通大学高性能计算中心 上海200240;东京工业大学 东京152-8550
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家重点研发计划(2016YFB0201400,2016YFB0201800),日本学术振兴会JSPS的RONPAKU项目,并行计算机工程技术研究中心资助

Porting and Optimizing OpenFOAM on Sunway TaihuLight System

MENG De-long, WEN Min-hua, WEI Jian-wen and James LIN   

  • Online:2018-12-01 Published:2018-12-01

摘要: 神威太湖之光是最新一期Top500榜单上排名第一的超级计算机,峰值性能为125.4 PFlops,其计算能力主要归功于国产SW26010众核处理器。OpenFOAM(Open Source Field Operation and Manipulation)是计算流体力学领域使用最广泛的开源软件包,但是由于其基于C++实现,与神威太湖之光上的异构众核处理器SW26010的编译器不兼容,因此无法直接在该架构上有效运行。基于SW26010的主核/从核的体系架构移植了OpenFOAM的核心计算代码,并采用混合语言编程实现的方式来解决编译不兼容的问题。此外,通过寄存器通信、向量化和双缓冲等优化手段,单核组的性能较优化后的主核代码提高了8.03倍,较Intel(R) Xeon(R) CPU E5-2695 v3的串行执行性能提高了1.18倍。同时,将单核组的实现扩展到了神威太湖之光的大规模集群上,并进行了强可扩展性测试,256个核组上实现了184.9倍的加速。采用的移植方式和优化手段也可以为其他复杂C++程序在神威太湖之光上的应用提供借鉴。

关键词: 计算流体力学,OpenFOAM,异构多核处理器,神威超级计算机

Abstract: The Sunway TaihuLight supercomputer based on the Chinese-designed many-core processors is the world’s fastest system with a peak performance of 125.4 PFlops.OpenFOAM (open source field operation and manipulation) is one of the most popular open source computational fluid dynamics (CFD) software which is written in C++ and not fully compatible with compilers on the heterogeneous many-core processor SW26010.This paper ported OpenFOAM based on SW26010’s MPE(management processing element)/CPE (computing processing element) cluster architecture.To overcome the compilation incompatibility problem,we adopted the mixed-language application design.We also applied several SW26010’s feature-specific optimizations on the hotspot of OpenFOAM to deliver high performance,such as the register communication,vectorization,and double buffering.The experiments on SW26010 using real datasets show that the single-CG (core group) code runs 8.03x faster than the well-tuned version on the MPE,and the performance of single-CG is 1.18x higher than the serial implementation of Intel(R) Xeon(R) CPU E5-2695 v3.We also optimized the parallel implementation of OpenFOAM and yielded speedups of 184.9x on 256 CGs.The porting methods and optimizations presented can also be referenced for other complex C++ programs to achieve high performance on SW26010.

Key words: CFD,OpenFOAM,Heterogeneous many-core processor,Sunway supercomputer

[1] ANDERSON J D,WENDT J.Computational fluid dynamics[M].New York:McGraw-Hill,1995.
[2] ALONAZI A A.Design and optimization of openfoam-basedCFD applications for modern hybrid and heterogeneous HPC platforms[D].King Abdullah University of Science and Technology,2014.
[3] WELLER H G,TABOR G,JASAK H,et al.A tensorial approach to computational continuum mechanics using object-oriented techniques[J].Computers in Physics,1998,12(6):620-631.
[4] DONGARRA J.Report on the Sunway TaihuLight System.http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf.
[5] FU H,LIAO J,YANG J,et al.The Sunway TaihuLight supercomputer:system and applications[J].Science China Information Sciences,2016,59(7):072001.
[6] ZHENG F,ZHANG K,WU G M,et al.Architecture Techni-ques of Many-Core Processor for Energy-Efficient in High Performance Computing[J].Chinese Journal of Computers,2014,7(10):2176-2186.(in Chinese) 郑方,张昆,邬贵明,等.面向高性能计算的众核处理器结构级高能效技术[J].计算机学报,2014,37(10):2176-2186.
[7] BELL N,GARLAND M.Implementing sparse matrix-vectormultiplication on throughput-oriented processors[C]∥Procee-dings of the Conference on High Performance Computing Networking,Storage and Analysis.ACM,2009:18.
[8] HARRIS M.Optimizing parallel reduction in CUDA[J].NVIDIA Developer Technology,2007,2(4):511-519.
[9] KLCKNER A.Iterative CUDA .http://mathema.tician.de/software/iterative-cuda.
[10] THIBAULT J C,SENOCAK I.CUDA implementation of aNavier-Stokes solver on multi-GPU desktop platforms for incompressible flows[C]∥Proceedings of the 47th AIAA Aerospace Sciences Meeting.2009:1-15.
[11] TLKE J.Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA[J].Computing and Visualization in Science,2010,13(1):29-39.
[12] KRAWEZIK G P,POOLE G.Accelerating the ANSYS direct sparse solver with GPUs[C]∥Proc.Symposium on Application Accelerators in High Performance Computing (SAAHPC).NCSA,Urbana-Champaign,2009.
[13] COMBEST D P,DAY J.Cufflink:a library for linking numerical methods based on cuda c/c++ with openfoam[J/OL].http://cufflink-library.googlecode.com.
[14] YING Z.Research on Acceleration of Openfoam Based on GPU[D].Shanghai:Shanghai Jiao Tong University,2012.(in Chinese) 应智.基于 GPU 的 OpenFOAM 并行加速研究[D].上海:上海交通大学,2012.
[15] HE X,ZHOU M Z,LIU X.Design and Implementation of Multi-level Heterogenous Parallel Algorithm of 3D Acoustic Wave Equation Forwarded[J].Computer Applications and Software,2014,1(1):264-267.(in Chinese) 何香,周明忠,刘鑫.三维声波方程正演多级异构并行算法设计与实现[J].计算机应用与软件,2014,31(1):264-267.
[16] XU J C,GUO S Z,HUANG Y Z,et al.Access Optimization Technique for Mathematical Library of Slave Processors on He-terogeneous Many-core Architectures[J].Computer Science,2014,1(6):12-17.(in Chinese) 许瑾晨,郭绍忠,黄永忠,等.面向异构众核从核的数学函数库访存优化方法[J].计算机科学,2014,41(6):12-17.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .