新一代神威处理器上高效任务流并行系统

doi:10.11896/jsjkx.231100135

计算机科学 ›› 2024, Vol. 51 ›› Issue (12): 137-146.doi: 10.11896/jsjkx.231100135

新一代神威处理器上高效任务流并行系统

傅游¹, 杜雷明¹, 高希然², 陈莉²

1 山东科技大学计算机科学与工程学院山东青岛 266590
2 中国科学院计算技术研究所处理器芯片全国重点实验室北京 100190

收稿日期:2023-11-20 修回日期:2024-04-03 出版日期:2024-12-15 发布日期:2024-12-10
通讯作者: 陈莉(lchen@ict.ac.cn)
作者简介:(fuyou@sdust.edu.cn)
基金资助:
山东省自然科学基金(ZR2022MF274,ZR2021LZH004);国家重点研发计划(2017YFB0202002)

Efficient Task Flow Parallel System for New Generation Sunway Processor

FU You¹, DU Leiming¹, GAO Xiran², CHEN Li²

1 College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, China
2 State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing 100190, China

Received:2023-11-20 Revised:2024-04-03 Online:2024-12-15 Published:2024-12-10
About author:FU You,born in 1968,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.26368M).Her main research interests include high performance computing,distributed computing and intelligent computing.
CHEN Li,born in 1970,Ph.D,associate professor,is a member of CCF(No.05815M).Her main research interests include parallel programming languages and parallelizing compiling techniques.
Supported by:
Natural Science Foundation of Shandong Province(ZR2022MF274, ZR2021LZH004)and National Key Research and Development Program of China(2017YFB0202002).

摘要/Abstract

摘要： 我国自主研制的新一代神威超级计算机相比前一代的神威太湖之光,具有更强大的内存系统和更高的计算密度,其主力编程模型仍然是块同步(Bulk Synchronous Parallelism,BSP)模型。顺序任务流(Sequential Task Flow,STF)模型基于数据流信息实现对串行程序的自动任务并行,并通过任务间的细粒度同步实现异步并行,相比于BSP模型的全局同步,并行度更高,负载更均衡。STF模型为用户高效使用神威平台提供了一种新选择。但在众核系统上,STF模型的运行时开销会直接影响并行程序性能。首先,分析新一代神威处理器影响STF模型高效实现的两个特征;然后,利用处理器架构的独有特性,提出一种基于代理的数据流构图机制以实现模型的构图需求,以及一种无锁的集中式任务调度机制以优化调度开销。最后,基于以上技术,为AceMesh模型实现了高效的任务流并行系统。实验表明,实现的任务流并行系统相比传统运行时支持优势显著,在细粒度任务场景下最高加速2.37倍;AceMesh性能高于神威平台的OpenACC模型,对典型应用的加速最高达到2.07倍。

关键词: 顺序任务流模型, 异构众核并行, 任务调度, 数据流并行, 块同步模型

Abstract: China’s independently developed next-generation Sunway supercomputer features a more powerful memory system and higher computational density compared to its predecessor,the Sunway TaihuLight.Its primary programming model remains the bulk synchronous parallelism(BSP) model.The sequential task flow(STF) model,based on data flow information,automates the task parallelization of serial programs and achieves asynchronous parallelism through fine-grained synchronization between tasks.Compared to the global synchronization of the BSP model,STF offers higher parallelism and more balanced load distribution,providing users with a new option for efficiently utilizing the Sunway platform.However,on many-core systems,the runtime overhead of the STF model directly impacts the performance of parallel programs.This paper first analyzes two characteristics of the new Sunway processor that affect the efficient implementation of the STF model.Then,leveraging the unique features of the processor architecture,it proposes an agent-based dataflow graph construction mechanism to meet the modeling requirements and a lock-free centralized task scheduling mechanism to optimize scheduling overhead.Finally,based on these technologies,an efficient task flow parallel system is implemented for the AceMesh model.Experiments show that the implemented task flow parallel system has significant advantages over traditional runtime support,achieving a maximum speedup of 2.37 times in fine-grained task scenarios;the performance of AceMesh exceeds that of the OpenACC model on the Sunway platform,with a maximum speedup of 2.07 times for typical applications.

Key words: Sequential task flow model, Heterogeneous multi-core parallelism, Task scheduling, Dataflow parallelism, Bulk synchronous model

中图分类号:

TP311

傅游, 杜雷明, 高希然, 陈莉. 新一代神威处理器上高效任务流并行系统[J]. 计算机科学, 2024, 51(12): 137-146. https://doi.org/10.11896/jsjkx.231100135

FU You, DU Leiming, GAO Xiran, CHEN Li. Efficient Task Flow Parallel System for New Generation Sunway Processor[J]. Computer Science, 2024, 51(12): 137-146. https://doi.org/10.11896/jsjkx.231100135

参考文献

[1]DURAN A,PEREZ J M,AYGUADE E,et al.Extending the OpenMP tasking model to allow dependent tasks[C]//OpenMP in a New Era of Parallelism:4th International Workshop,IWOMP 2008 West Lafayette,USA,May 12-14,2008 Procee-dings 4.Berlin,Germany:Springer,2008:111-122.
[2]DURAN A,AYGUADE E,BADIA R M,et al.OmpSs:A proposal for programming heterogeneous multi-core architectures[J].Parallel Processing Letters,2011,21(2):173-193.
[3]OpenMP ARB.OpenMP application program interface version 4.0[R].The OpenMP Forum,2013.
[4]LEE J,SATO M.Implementation and performance evaluation of xcalableMP:A parallel programming language for distributed memory systems[C]//2010 39th International Conference on Parallel Processing Workshops.IEEE,2010:413-420.
[5]CHEN L,TANG S,FU Y,et al.AceMesh:A structured data driven programming language for high performance computing[J].CCF Transactions on High Performance Computing,2020,2:309-322.
[6]CHEN X,GAO Y,SHANG H,et al.Increasing the efficiency of massively parallel sparse matrix-matrix multiplication in first-principles calculation on the new-generation Sunway supercomputer[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):4752-4766.
[7]FU Y,WANG T,GUO Q,et al.Parallelization and optimization of Tend_Lin on Sunway TaihuLight system[J].Journal of Shandong University of Science and Technology(Natural Science),2019,38(2):90-99.
[8]GUO J,GAO X R,CHEN L.et al.Parallelizing multigrid application using data-driven programming model[J].Compurter Science,2020,47(8):32-40.
[9]YE Y X,FU Y,LIANG J G,et al.Composition optimizationmethod of AceMesh programming model on Sunway TaihuLight Platform[J].Journal of Shandong University of Science and Technology(Natural Science),2021,40(4):76-85.
[10]TANG X,ZHANG C,ZHAI J,et al.A fast lock for explicit message passing architectures[J].IEEE Transactions on Computers,2020,70(10):1555-1568.
[11]ÁLVAREZ D,SALA K,MARONAS M,et al.Advanced synchronization techniques for task-based runtime systems[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.New York,NY,USA:ACM,2021:334-347.
[12]PILLET V,LABARTA J,CORTES T,et al.Paraver:A tool to visualize and analyze parallel code[C]//Proceedings of WoTUG-18:Transputer and OCCAM Developments.Amsterdam:IOS Press,1995:17-31.
[13]AUGONNET C,THIBAULT S,NAMYST R,et al.StarPU:A unified platform for task scheduling on heterogeneous multicore architectures[C]//Euro-Par 2009 Parallel Processing:15th International Euro-Par Conference,Delft,The Netherlands,August 25-28,2009.Berlin,Germany:Springer,2009:863-874.
[14]CAO C,HERAULT T,BOSILCA G,et al.Design for a soft error resilient dynamic task-based runtime[C]//2015 IEEE International Parallel and Distributed Processing Symposium.IEEE,2015:765-774.
[15]YARKHAN A,KURZAK J,DONGARRA J.Quark users’guide:Queueing and runtime forkernels:Technical Report:ICL-UT-11-02[R].University of Tennessee Innovative Computing Laboratory,2011.
[16]TILLENIUS M.Scientific computing on multicore architectures[D].Sweden:Uppsala University,2014.
[17]VANDIERENDONCK H,TZENAKIS G,NIKOLOPOULOS D S.Analysis of dependence tracking algorithms for task dataflow execution[J].ACM Transactions on Architecture and Code Optimization(TACO),2013,10(4):1-24.
[18]BOSCH J,ÁLVAREZ C,JIMENEZ-GONZALEZ D,et al.Asynchronous runtime with distributed manager for task-based programming models[J].Parallel Computing,2020,97:102664.
[19]CASTES C,AGULLO E,AUMAGE O,et al.Decentralized in-order execution of a sequential task-based code for shared-memory architectures[C]//2022 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).IEEE,2022:552-561.
[20]WANG Y,ZHANG Y,SU Y,et al.An adaptive and hierarchical task scheduling scheme for multi-core clusters[J].Parallel Computing,2014,40(10):611-627.
[21]MUDDUKRISHNA A,JONSSON P A,BRORSSON M.Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors[J].Scienti-fic Programming,2016,2015:5.
[22]OLIVIER S L,PORTERFIELD A K,WHEELER K B,et al.OpenMP task scheduling strategies for multicore NUMA systems[J].The International Journal of High Performance Computing Applications,2012,26(2):110-124.
[23]NOOKALA P,DINDA P,HALE K C,et al.Enabling extremelyfine-grained parallelism via scalable concurrent queues on mo-dern many-core architectures[C]//2021 29th International Symposium on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems(MASCOTS).IEEE,2021:1-8.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

新一代神威处理器上高效任务流并行系统

Efficient Task Flow Parallel System for New Generation Sunway Processor

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0