计算机科学 ›› 2024, Vol. 51 ›› Issue (4): 56-66.doi: 10.11896/jsjkx.231000124

• 高性能计算 • 上一篇    下一篇

天气预报模型WRF中复杂Stencil性能优化

邸健强1,2, 袁良1, 张云泉1, 张思佳2   

  1. 1 中国科学院计算技术研究所高性能计算机研究中心 北京100190
    2 大连海洋大学信息工程学院 辽宁 大连116023
  • 收稿日期:2023-10-18 修回日期:2024-02-04 出版日期:2024-04-15 发布日期:2024-04-10
  • 通讯作者: 袁良(yuanliang@ict.ac.cn)
  • 作者简介:(97239401@qq.com)
  • 基金资助:
    国家自然科学基金(61972376,62072431,62032023);华为项目(TC20220914048)

Performance Optimization of Complex Stencil in Weather Forecast Model WRF

DI Jianqiang1,2, YUAN Liang1, ZHANG Yunquan1, ZHANG Sijia2   

  1. 1 High Performance Computer Research Center,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
    2 School of Information Science and Engineering,Dalian Ocean University,Dalian,Liaoning 116023,China
  • Received:2023-10-18 Revised:2024-02-04 Online:2024-04-15 Published:2024-04-10
  • Supported by:
    National Natural Science Foundation of China(61972376,62072431,62032023) and Huawei(TC20220914048).

摘要: 天气研究与预报模式(WRF)是一种应用广泛的中尺度数值天气预报系统,在大气研究和业务预报领域发挥着重要作用。Stencil计算是科学工程应用中一类常见的嵌套循环计算模式,WRF中对大气动力学和热力学方程的数值求解引出了大量空间网格上的复杂Stencil计算,存在多维度、多变量、物理模型边界特殊性、物理和动力学过程的复杂性等模型特征。文中深入剖析了WRF中典型的Stencil计算模式,识别抽象出典型Stencil循环中存在的“中间变量”概念,围绕其设计实现了3种优化方案,即中间变量计算合并、中间变量降维存储以及中间变量提取,有效提高了数据局部性,改善了数据重用率和空间复用率,降低了冗余计算和访存开销。结果表明,经优化方案重构的WRF 4.2典型Stencil热点函数在Intel CPU和Hygon CPU上均可获得良好的性能加速,最高加速比达21.3%和17.8%。

关键词: WRF, Stencil计算, 中间变量, 优化方案, 数据局部性, 热点函数, 性能加速

Abstract: The weather research and forecasting model(WRF) is a widely used mesoscale numerical weather forecasting system that plays an important role in the fields of atmospheric research and meteorological operational forecasting.Stencil computation is a common nested loop pattern in scientific and engineering applications.WRF performs a large number of complex stencil computation on spatial grids to solve numerical equations of atmospheric dynamics and thermodynamics.The stencils in WRF are featured by multi-dimensionality,multi-variables,particularity of physical model boundaries,and complexity of physical and dynamic processes.This study analyzes the typical stencil pattern in WRF,identifies and abstracts the concept of “intermediate variable”,and implements three optimization schemes,namely,intermediate variable computation merging,intermediate variable dimensio-nality reduction storage,and intermediate variables extraction.The optimization schemes effectively improve the data locality,increase data reuse and spatial reuse rates,and reduces redundant computing and memory access overhead.The results show that the WRF 4.2 typical hotspot functions achieve significant performance improvements on both Intel CPU and Hygon CPU,with the highest speedup ratios of 21.3% and 17.8% respectively.

Key words: WRF, Stencil computation, Intermediate variable, Optimization scheme, Data locality, Hotspot function, Performance improvement

中图分类号: 

  • TP319
[1]YUAN L,ZHANG Y,GUO P,et al.Tessellating Stencils[C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.2017:1-13.
[2]YUAN L,HUANG S,ZHANG Y,et al.Tessellating star Stencils[C]//Proceedings of the 48th International Conference on Parallel Processing.2019:1-10.
[3]YUAN L,CAO H,ZHANG Y,et al.Temporal vectorization for Stencils[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-13.
[4]LI K,YUAN L,ZHANG Y,et al.An efficient vectorizationscheme for Stencil computation[C]//2022 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2022:650-660.
[5]LI K,YUAN L,ZHANG Y,et al.Reducing redundancy in data organization and arithmetic calculation for Stencil computations[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.2021:1-15.
[6]YUAN L,DING C,SMITH W,et al.A relational theory of locality[J].ACM Transactions on Architecture and Code Optimization(TACO),2019,16(3):1-26.
[7]YUAN L,DING C,DENNING P,et al.A measurement theory of locality[J].arXiv:1802.01254,2018.
[8]YUAN L,XIAO J.SI on parallel system and algorithm optimization[J].CCF Transactions on High Performance Computing,2023,5(3):229-230.
[9]HUANG M,MIELIKAINEN J,HUANG B,et al.Developmentof efficient GPU parallelization of WRF Yonsei University pla-netary boundary layer scheme[J].Geoscientific Model Development,2015,8(9):2977-2990.
[10]MIELIKAINEN J,HUANG B,HUANG A.Optimizing weather and researchforecast(WRF) Thompson cloud microphysics on Intel Many Integrated Core(MIC)[C]//Satellite Data Compression,Communications,and Processing X.SPIE,2014,9124:182-193.
[11]WANG S D.WRF mode transplantation and optimization based on “Shenwei 26010” heterogeneous many-core processor[D].Jinan:Shandong University,2020.
[12]MALAKAR P,SAXENA V,GEORGE T,et al.Performanceevaluation and optimization of nested high resolution weather simulations[C]//Euro-Par 2012 Parallel Processing:18th International Conference.Berlin Heidelberg:Springer,2012:805-817.
[13]HASHMI J M,CHU C H,CHAKRABORTY S,et al.FAL-CON-X:Zero-copy MPI derived datatype processing on modern CPU and GPU architectures[J].Journal of Parallel and Distri-buted Computing,2020,144:1-13.
[14]HUANG J,WANG W,WANG Y,et al.Performance Evaluation and Optimization of the Weather Research and Forecasting(WRF) Model Based on Kunpeng 920[J].Applied Sciences,2023,13(17):9800.
[15]SOBHANI N,DEL VENTO D,GILL D.Performance analysisand optimization of the Weather Research and Forecasting Mo-del(WRF) advection schemes[C]//Third Symp.on High Performance Computing for Weather,Water,and Climate.Seattle,WA,Amer.Meteor.Soc.2017,3.
[16]MIELIKAINEN J,HUANG B,HUANG A H L.Optimizingzonal advection of the Advanced Research WRF(ARW) dyna-mics for Intel MIC[C]//High-Performance Computing in Remote Sensing IV.SPIE,2014,9247:162-172.
[17]MIELIKAINEN J,HUANG B,HUANG A H L.Optimizingmeridional advection of the Advanced Research WRF(ARW) dynamics for Intel Xeon Phi coprocessor[C]//Satellite Data Compression,Communications,and Processing XI.SPIE,2015,9501:246-258.
[18]AO Y,YANG C,WANG X,et al.26 pflops Stencil computations for atmospheric modeling on sunway taihulight[C]//2017 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2017:535-544.
[19]XU K,SONG Z,CHAN Y,et al.Refactoring and optimizingWRF model on sunway taihulight[C]//Proceedings of the 48th International Conference on Parallel Processing.2019:1-10.
[20]LI M,LIU Y,YANG H,et al.Automatic code generation and optimization of large-scale stencil computation on many-core processors[C]//Proceedings of the 50th International Confe-rence on Parallel Processing.2021:1-12.
[21]ZHANG K,SU H,DOU Y.Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures[J].The Journal of Supercomputing,2021,77(11):13584-13600.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!