计算机科学 ›› 2024, Vol. 51 ›› Issue (12): 137-146.doi: 10.11896/jsjkx.231100135
傅游1, 杜雷明1, 高希然2, 陈莉2
FU You1, DU Leiming1, GAO Xiran2, CHEN Li2
摘要: 我国自主研制的新一代神威超级计算机相比前一代的神威太湖之光,具有更强大的内存系统和更高的计算密度,其主力编程模型仍然是块同步(Bulk Synchronous Parallelism,BSP)模型。顺序任务流(Sequential Task Flow,STF)模型基于数据流信息实现对串行程序的自动任务并行,并通过任务间的细粒度同步实现异步并行,相比于BSP模型的全局同步,并行度更高,负载更均衡。STF模型为用户高效使用神威平台提供了一种新选择。但在众核系统上,STF模型的运行时开销会直接影响并行程序性能。首先,分析新一代神威处理器影响STF模型高效实现的两个特征;然后,利用处理器架构的独有特性,提出一种基于代理的数据流构图机制以实现模型的构图需求,以及一种无锁的集中式任务调度机制以优化调度开销。最后,基于以上技术,为AceMesh模型实现了高效的任务流并行系统。实验表明,实现的任务流并行系统相比传统运行时支持优势显著,在细粒度任务场景下最高加速2.37倍;AceMesh性能高于神威平台的OpenACC模型,对典型应用的加速最高达到2.07倍。
中图分类号:
[1]DURAN A,PEREZ J M,AYGUADE E,et al.Extending the OpenMP tasking model to allow dependent tasks[C]//OpenMP in a New Era of Parallelism:4th International Workshop,IWOMP 2008 West Lafayette,USA,May 12-14,2008 Procee-dings 4.Berlin,Germany:Springer,2008:111-122. [2]DURAN A,AYGUADE E,BADIA R M,et al.OmpSs:A proposal for programming heterogeneous multi-core architectures[J].Parallel Processing Letters,2011,21(2):173-193. [3]OpenMP ARB.OpenMP application program interface version 4.0[R].The OpenMP Forum,2013. [4]LEE J,SATO M.Implementation and performance evaluation of xcalableMP:A parallel programming language for distributed memory systems[C]//2010 39th International Conference on Parallel Processing Workshops.IEEE,2010:413-420. [5]CHEN L,TANG S,FU Y,et al.AceMesh:A structured data driven programming language for high performance computing[J].CCF Transactions on High Performance Computing,2020,2:309-322. [6]CHEN X,GAO Y,SHANG H,et al.Increasing the efficiency of massively parallel sparse matrix-matrix multiplication in first-principles calculation on the new-generation Sunway supercomputer[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):4752-4766. [7]FU Y,WANG T,GUO Q,et al.Parallelization and optimization of Tend_Lin on Sunway TaihuLight system[J].Journal of Shandong University of Science and Technology(Natural Science),2019,38(2):90-99. [8]GUO J,GAO X R,CHEN L.et al.Parallelizing multigrid application using data-driven programming model[J].Compurter Science,2020,47(8):32-40. [9]YE Y X,FU Y,LIANG J G,et al.Composition optimizationmethod of AceMesh programming model on Sunway TaihuLight Platform[J].Journal of Shandong University of Science and Technology(Natural Science),2021,40(4):76-85. [10]TANG X,ZHANG C,ZHAI J,et al.A fast lock for explicit message passing architectures[J].IEEE Transactions on Computers,2020,70(10):1555-1568. [11]ÁLVAREZ D,SALA K,MARONAS M,et al.Advanced synchronization techniques for task-based runtime systems[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.New York,NY,USA:ACM,2021:334-347. [12]PILLET V,LABARTA J,CORTES T,et al.Paraver:A tool to visualize and analyze parallel code[C]//Proceedings of WoTUG-18:Transputer and OCCAM Developments.Amsterdam:IOS Press,1995:17-31. [13]AUGONNET C,THIBAULT S,NAMYST R,et al.StarPU:A unified platform for task scheduling on heterogeneous multicore architectures[C]//Euro-Par 2009 Parallel Processing:15th International Euro-Par Conference,Delft,The Netherlands,August 25-28,2009.Berlin,Germany:Springer,2009:863-874. [14]CAO C,HERAULT T,BOSILCA G,et al.Design for a soft error resilient dynamic task-based runtime[C]//2015 IEEE International Parallel and Distributed Processing Symposium.IEEE,2015:765-774. [15]YARKHAN A,KURZAK J,DONGARRA J.Quark users’guide:Queueing and runtime forkernels:Technical Report:ICL-UT-11-02[R].University of Tennessee Innovative Computing Laboratory,2011. [16]TILLENIUS M.Scientific computing on multicore architectures[D].Sweden:Uppsala University,2014. [17]VANDIERENDONCK H,TZENAKIS G,NIKOLOPOULOS D S.Analysis of dependence tracking algorithms for task dataflow execution[J].ACM Transactions on Architecture and Code Optimization(TACO),2013,10(4):1-24. [18]BOSCH J,ÁLVAREZ C,JIMENEZ-GONZALEZ D,et al.Asynchronous runtime with distributed manager for task-based programming models[J].Parallel Computing,2020,97:102664. [19]CASTES C,AGULLO E,AUMAGE O,et al.Decentralized in-order execution of a sequential task-based code for shared-memory architectures[C]//2022 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).IEEE,2022:552-561. [20]WANG Y,ZHANG Y,SU Y,et al.An adaptive and hierarchical task scheduling scheme for multi-core clusters[J].Parallel Computing,2014,40(10):611-627. [21]MUDDUKRISHNA A,JONSSON P A,BRORSSON M.Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors[J].Scienti-fic Programming,2016,2015:5. [22]OLIVIER S L,PORTERFIELD A K,WHEELER K B,et al.OpenMP task scheduling strategies for multicore NUMA systems[J].The International Journal of High Performance Computing Applications,2012,26(2):110-124. [23]NOOKALA P,DINDA P,HALE K C,et al.Enabling extremelyfine-grained parallelism via scalable concurrent queues on mo-dern many-core architectures[C]//2021 29th International Symposium on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems(MASCOTS).IEEE,2021:1-8. |
|