基于国产众核架构的非结构网格分区块重构预处理算法研究

doi:10.11896/jsjkx.210900045

计算机科学 ›› 2022, Vol. 49 ›› Issue (6): 73-80.doi: 10.11896/jsjkx.210900045

基于国产众核架构的非结构网格分区块重构预处理算法研究

叶跃进¹, 李芳¹, 陈德训², 郭恒², 陈鑫¹

1 国家超级计算无锡中心江苏无锡 214000
2 清华大学计算机科学与技术系北京 100084

收稿日期:2021-09-05 修回日期:2022-02-21 出版日期:2022-06-15 发布日期:2022-06-08
通讯作者: 李芳(lifang56@163.net)
作者简介:(ye_ddr@foxmail.com)
基金资助:
国家重点研发计划“高性能计算”重点专项(2020YFB0204804,2016YFB0201100)

Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture

YE Yue-jin¹, LI Fang¹, CHEN De-xun², GUO Heng², CHEN Xin¹

1 National Supercomputing Center in Wuxi,Wuxi,Jiangsu 214000,China
2 Department of Computer Science and Technology,Tsinghua University,Beijin 100084,China

Received:2021-09-05 Revised:2022-02-21 Online:2022-06-15 Published:2022-06-08
About author:YE Yue-jin,born in 1991,master,engineer,is a member of China Computer Federation.His main research interests include high performance computing and so on.
LI Fang,born in 1980,postgraduate,Ph.D,associate professor.Her main research interests include high perfor-mance computing and so on.
Supported by:
National High Performance Computing Foundation of China(2020YFB0204804,2016YFB0201100).

摘要/Abstract

摘要： 如何高效地解决非结构网格离散访存问题一直是科学与工程计算并行算法和应用领域关注的核心热点问题之一。基于国产申威异构众核架构而设计的分布式区块重连的优化算法,在解决应用课题中的非结构稀疏问题时能始终保持高效的计算性能。通过深入分析众核架构片上的通信机制来设计高效的消息分组策略,以提高从核片上阵列带宽的利用率,同时结合无栅栏数据分发算法充分发挥国产异构众核体系架构网络的性能。通过建立性能模型与实验测试分析可知,该算法在不同访存特征下平均内存带宽能达到理论值的70%以上,与主核串行算法相比具有平均10倍和最高45倍的加速性能。同时通过对多个不同领域的应用进行测试分析也证明了该算法的普适性。

关键词: 非结构网格, 国产众核架构, 片上通信, 无栅栏数据分发, 消息分组

Abstract: How to efficiently solve the discrete-memory-accessing problem of unstructed-grid is one of the hot-spot issues in the field of parallel algorithms and application in scientific and engineering computing.The distributed block reconnection optimization algorithm,which is designed on the basis of domestic Sunway heterogeneous many-core architecture,can maintain high computing performance when solving the problem of unstructured sparsity in applications.After deeply analyzing the on-chip communication mechanism of the many-core architecture,an efficient message grouping strategy is designed to improve the bandwidth utilization of on-chip array on the slave core.At the same time,a barrier-free data distribution algorithm is combined to give full play to the network perfor-mance of the domestic heterogeneous many-core architecture.Through the establishment of perfor-mance models and experimental analysis,the average memory bandwidth of the proposed algorithm can reach more than 70% of the theoretical value under different memory access situations.Compared with the serial algorithm on the master core,it has an ave-rage of 10 times and a maximum of 45 times performance acceleration.At the same time,the universal applicability of the algorithm is proved by application tests in different fields.

Key words: Barrier-free data distribution, Domestic many-core architecture, Message grouping, On-chip communication, Unstructed-grid

中图分类号:

TP311

叶跃进, 李芳, 陈德训, 郭恒, 陈鑫. 基于国产众核架构的非结构网格分区块重构预处理算法研究[J]. 计算机科学, 2022, 49(6): 73-80. https://doi.org/10.11896/jsjkx.210900045

YE Yue-jin, LI Fang, CHEN De-xun, GUO Heng, CHEN Xin. Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture[J]. Computer Science, 2022, 49(6): 73-80. https://doi.org/10.11896/jsjkx.210900045

参考文献

[1] LI YY,XUE W,CHEN D X,et al.Performance optimization of sparse matrix vector multiplication on Sunway many-core architecture[J].Chinese Journal of Computers,2020,43(6):1011-1020.
[2] ZHENG F,LI H L,LV H,et al.Cooperative computing techniques for a deeply fused and heterogeneous many-core processor architecture[J].Journal of Computer Science and Techno-logy,2015,30(1):145-162.
[3] GUNNELS J A,HENRY G M,VAN DE GEIJN R A.A Family of High-Performance Matrix Multiplication Algorithms[C]//Proceedings of the International Conference on Computational Sciences-Part I.London,UK,UK:Springer-Verlag,2001:51-60.
[4] GOTO K,VAN DE GRIJN R.High-performance Implementation of the Level-3BLAS[J].ACM Transaction on Mathematical Software,2008,35(4):1-14.
[5] CHECCONI F,PETRINI F,WILLCOCK J,et al.Breaking the speed and scalability barriers for graph exploration on distributed-memory machines[C]//International Conference on Storage Anal & High Performance Computing Networking.SC12,2012.
[6] UENO K,SUZUMURA T,MARUYAMA N,et al.Exremescale breath- first search chon super computer[C]//Big Data (Big Data).IEEE International Conference,2016:1040-1047.
[7] BEAMER S,BULUC A,ASANOVIC K,et al.Distributed me-mory breadth-first search revisited:Enabling bottom-up search[C]//Parallel and Distributed Porcessing Symposium Workshops.IEEE International Conference,2013:1618-1627.
[8] CHECCONI F,PETRINI F.Traversing trillions of edges in real time:Graph exploration on large scale parallel machines[C]//International Conference & International Parallel and Distributed Processing Symposium.IEEE International Conference,2014:425-434.
[9] BISSON M,BERNASCHI M,MASTRONSTEFANO E.Parallel Distributed Breadth First Search on the Kepler Architecture[J].IEEE Transaction on Parallel and Distributed System,2016,27(7):2091-2102.
[10] LIAO J F.Redesigning CAM-SE for Peta-Scale Climate Mode-ling Performance on Sunway TaihuLight[D].Beijing:Tsinghua University,2017.
[11] LI F,LI Z H,XU J X,et al.Research on Adaptation of CFD Software Based on Many-core Architecture of 100P Domestic Supercomputing System[J].Chinese Journal of Computers,2020,47(1):1-8.
[12] AO Y L.Research on Key Optimizations of Sparse Matrix and Stencil Computation for the Domestic Large Many-core System[D].Hefei:University of Science and Technology of China,2017.
[13] AN H,YU Y,CHEN J S,et al.Pipelining Computation and Optimization Strategies for Scaling GROMACS on the Sunway Many-core Processor[C]//International Conference on Algorithms and Architectures for Parallel Processing.2018:134-137.
[14] KOURTIS K,KARAKASIS V,GOUMAS G,et al.Csx:An extended compression format for spmv on shared memory system[J].ACM SIGPLAN Notices,2011,46(2):247-256
[15] SUN Q,ZHANG C Y.Bandwith reduced parallel SpMV on the SW26010 many-core platform[C]//Proceedings of the 47th International Conference on Parallel Processing Eugence.USA,2018:1-10.
[16] ASHARI A,SEDAGHATI N,EISENLOHR J,et al.An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs[C]//Proceedings of the 28th ACM International Conference on Supercomputing.ACM,2014:273-282.
[17] LIU C X,XIE B W,LIU X,et al.Towards efficeient SpMV on sunway many-core architectures[C]//Proceedings of the 2018 International Conference on Supercomputing.Portland,USA,2018:363-373.
[18] NI H,LIU X.Many-core Optimization Technology Of Unstructured-grid On SunWay TaihuLight[J].Computer Engineering,2019,45(6):51-57.
[19] LIN H.Extreme-scale graph analysis on heterogeneous architecture[D].Beijing:Tsinghua University,2017.
[20] APHU E S,BRANTSON E T,ADDO B J,et al.Development of Finite Difference Explicit and Implicit Numerical Reservoir Simulator for Modelling Single Phase Flow in Porous Media[J].Earth Science,2018,134:2-10.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于国产众核架构的非结构网格分区块重构预处理算法研究

Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

Metrics

本文评价

推荐阅读 0

[1]	陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[2]	倪鸿, 刘鑫. 非结构网格下稀疏下三角方程求解器众核优化技术研究 Many-core Optimization for Sparse Triangular Solver Under Unstructured Grids 计算机科学, 2019, 46(6A): 518-522.
[3]	刘鑫，陆林生，陈德训. 非结构网格并行计算预处理方法研究 Research on Pre-processing Methods of Unstructured Grids 计算机科学, 2012, 39(3): 308-311.