新一代神威处理器上高效任务流并行系统

doi:10.11896/jsjkx.231100135

Abstract

Abstract: China’s independently developed next-generation Sunway supercomputer features a more powerful memory system and higher computational density compared to its predecessor,the Sunway TaihuLight.Its primary programming model remains the bulk synchronous parallelism(BSP) model.The sequential task flow(STF) model,based on data flow information,automates the task parallelization of serial programs and achieves asynchronous parallelism through fine-grained synchronization between tasks.Compared to the global synchronization of the BSP model,STF offers higher parallelism and more balanced load distribution,providing users with a new option for efficiently utilizing the Sunway platform.However,on many-core systems,the runtime overhead of the STF model directly impacts the performance of parallel programs.This paper first analyzes two characteristics of the new Sunway processor that affect the efficient implementation of the STF model.Then,leveraging the unique features of the processor architecture,it proposes an agent-based dataflow graph construction mechanism to meet the modeling requirements and a lock-free centralized task scheduling mechanism to optimize scheduling overhead.Finally,based on these technologies,an efficient task flow parallel system is implemented for the AceMesh model.Experiments show that the implemented task flow parallel system has significant advantages over traditional runtime support,achieving a maximum speedup of 2.37 times in fine-grained task scenarios;the performance of AceMesh exceeds that of the OpenACC model on the Sunway platform,with a maximum speedup of 2.07 times for typical applications.

Key words: Sequential task flow model, Heterogeneous multi-core parallelism, Task scheduling, Dataflow parallelism, Bulk synchronous model

CLC Number:

TP311

FU You, DU Leiming, GAO Xiran, CHEN Li. Efficient Task Flow Parallel System for New Generation Sunway Processor[J].Computer Science, 2024, 51(12): 137-146.

References

[1]DURAN A,PEREZ J M,AYGUADE E,et al.Extending the OpenMP tasking model to allow dependent tasks[C]//OpenMP in a New Era of Parallelism:4th International Workshop,IWOMP 2008 West Lafayette,USA,May 12-14,2008 Procee-dings 4.Berlin,Germany:Springer,2008:111-122.
[2]DURAN A,AYGUADE E,BADIA R M,et al.OmpSs:A proposal for programming heterogeneous multi-core architectures[J].Parallel Processing Letters,2011,21(2):173-193.
[3]OpenMP ARB.OpenMP application program interface version 4.0[R].The OpenMP Forum,2013.
[4]LEE J,SATO M.Implementation and performance evaluation of xcalableMP:A parallel programming language for distributed memory systems[C]//2010 39th International Conference on Parallel Processing Workshops.IEEE,2010:413-420.
[5]CHEN L,TANG S,FU Y,et al.AceMesh:A structured data driven programming language for high performance computing[J].CCF Transactions on High Performance Computing,2020,2:309-322.
[6]CHEN X,GAO Y,SHANG H,et al.Increasing the efficiency of massively parallel sparse matrix-matrix multiplication in first-principles calculation on the new-generation Sunway supercomputer[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):4752-4766.
[7]FU Y,WANG T,GUO Q,et al.Parallelization and optimization of Tend_Lin on Sunway TaihuLight system[J].Journal of Shandong University of Science and Technology(Natural Science),2019,38(2):90-99.
[8]GUO J,GAO X R,CHEN L.et al.Parallelizing multigrid application using data-driven programming model[J].Compurter Science,2020,47(8):32-40.
[9]YE Y X,FU Y,LIANG J G,et al.Composition optimizationmethod of AceMesh programming model on Sunway TaihuLight Platform[J].Journal of Shandong University of Science and Technology(Natural Science),2021,40(4):76-85.
[10]TANG X,ZHANG C,ZHAI J,et al.A fast lock for explicit message passing architectures[J].IEEE Transactions on Computers,2020,70(10):1555-1568.
[11]ÁLVAREZ D,SALA K,MARONAS M,et al.Advanced synchronization techniques for task-based runtime systems[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.New York,NY,USA:ACM,2021:334-347.
[12]PILLET V,LABARTA J,CORTES T,et al.Paraver:A tool to visualize and analyze parallel code[C]//Proceedings of WoTUG-18:Transputer and OCCAM Developments.Amsterdam:IOS Press,1995:17-31.
[13]AUGONNET C,THIBAULT S,NAMYST R,et al.StarPU:A unified platform for task scheduling on heterogeneous multicore architectures[C]//Euro-Par 2009 Parallel Processing:15th International Euro-Par Conference,Delft,The Netherlands,August 25-28,2009.Berlin,Germany:Springer,2009:863-874.
[14]CAO C,HERAULT T,BOSILCA G,et al.Design for a soft error resilient dynamic task-based runtime[C]//2015 IEEE International Parallel and Distributed Processing Symposium.IEEE,2015:765-774.
[15]YARKHAN A,KURZAK J,DONGARRA J.Quark users’guide:Queueing and runtime forkernels:Technical Report:ICL-UT-11-02[R].University of Tennessee Innovative Computing Laboratory,2011.
[16]TILLENIUS M.Scientific computing on multicore architectures[D].Sweden:Uppsala University,2014.
[17]VANDIERENDONCK H,TZENAKIS G,NIKOLOPOULOS D S.Analysis of dependence tracking algorithms for task dataflow execution[J].ACM Transactions on Architecture and Code Optimization(TACO),2013,10(4):1-24.
[18]BOSCH J,ÁLVAREZ C,JIMENEZ-GONZALEZ D,et al.Asynchronous runtime with distributed manager for task-based programming models[J].Parallel Computing,2020,97:102664.
[19]CASTES C,AGULLO E,AUMAGE O,et al.Decentralized in-order execution of a sequential task-based code for shared-memory architectures[C]//2022 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).IEEE,2022:552-561.
[20]WANG Y,ZHANG Y,SU Y,et al.An adaptive and hierarchical task scheduling scheme for multi-core clusters[J].Parallel Computing,2014,40(10):611-627.
[21]MUDDUKRISHNA A,JONSSON P A,BRORSSON M.Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors[J].Scienti-fic Programming,2016,2015:5.
[22]OLIVIER S L,PORTERFIELD A K,WHEELER K B,et al.OpenMP task scheduling strategies for multicore NUMA systems[J].The International Journal of High Performance Computing Applications,2012,26(2):110-124.
[23]NOOKALA P,DINDA P,HALE K C,et al.Enabling extremelyfine-grained parallelism via scalable concurrent queues on mo-dern many-core architectures[C]//2021 29th International Symposium on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems(MASCOTS).IEEE,2021:1-8.

Related Articles 15

[1]	LI Danyang, WU Liangji, LIU Hui, JIANG Jingqing. Deep Reinforcement Learning Based Thermal Awareness Energy Consumption OptimizationMethod for Data Centers [J]. Computer Science, 2024, 51(6A): 230500109-8.
[2]	LIU Chenwei, SUN Jian, LEI Bingbing, XU Tao, WU Zhuiwei. Task Scheduling Strategy for Energy Consumption Optimization of Cloud Data Center Based on Improved Particle Swarm Algorithm [J]. Computer Science, 2023, 50(7): 246-253.
[3]	HU Shengxi, SONG Rirong, CHEN Xing, CHEN Zheyi. Dependency-aware Task Scheduling in Cloud-Edge Collaborative Computing Based on Reinforcement Learning [J]. Computer Science, 2023, 50(11A): 220900076-8.
[4]	TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[5]	SHEN Biao, SHEN Li-wei, LI Yi. Dynamic Task Scheduling Method for Space Crowdsourcing [J]. Computer Science, 2022, 49(2): 231-240.
[6]	MA Xin-yu, JIANG Chun-mao, HUANG Chun-mei. Optimal Scheduling of Cloud Task Based on Three-way Clustering [J]. Computer Science, 2022, 49(11A): 211100139-7.
[7]	LIU Wen-wen, XIONG Wei, HAN Chi. Communication Satellite Task Relaxation Scheduling Method Based on Improved Hyper-heuristic Algorithm [J]. Computer Science, 2022, 49(11A): 210900125-6.
[8]	WANG Zheng, JIANG Chun-mao. Cloud Task Scheduling Algorithm Based on Three-way Decisions [J]. Computer Science, 2021, 48(6A): 420-426.
[9]	CAI Ling-feng, WEI Xiang-lin, XING Chang-you, ZOU Xia, ZHANG Guo-min. Failure-resilient DAG Task Rescheduling in Edge Computing [J]. Computer Science, 2021, 48(10): 334-342.
[10]	ZHANG Long-xin, ZHOU Li-qian, WEN Hong, XIAO Man-sheng, DENG Xiao-jun. Energy Efficient Scheduling Algorithm of Workflows with Cost Constraint in Heterogeneous Cloud Computing Systems [J]. Computer Science, 2020, 47(8): 112-118.
[11]	SUN Min, CHEN Zhong-xiong, YE Qiao-nan. Workflow Scheduling Strategy Based on HEDSM Under Cloud Environment [J]. Computer Science, 2020, 47(6): 252-259.
[12]	HU Jun-qin, ZHANG Jia-jun, HUANG Yin-hao, CHEN Xing, LIN Bing. Computation Offloading Scheduling Technology for DNN Applications in Edge Environment [J]. Computer Science, 2020, 47(10): 247-255.
[13]	ZHANG Zhou, HUANG Guo-rui, JIN Pei-quan. Task Scheduling on Storm:Current Situations and Research Prospects [J]. Computer Science, 2019, 46(9): 28-35.
[14]	ZENG Jin-jing, ZHANG Jian-shan, LIN Bing, ZHANG Wen-de. Cloudlet Workload Balancing Algorithm in Wireless Metropolitan Area Networks [J]. Computer Science, 2019, 46(8): 163-170.
[15]	ZHANG Jian-shan, LIN Bing, LU Yu, XU Fu-rong. Cloudlet Placement and User Task Scheduling Based on Wireless Metropolitan Area Networks [J]. Computer Science, 2019, 46(6): 128-134.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Efficient Task Flow Parallel System for New Generation Sunway Processor

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0