Computer Science ›› 2024, Vol. 51 ›› Issue (12): 137-146.doi: 10.11896/jsjkx.231100135

• High Performance Computing • Previous Articles     Next Articles

Efficient Task Flow Parallel System for New Generation Sunway Processor

FU You1, DU Leiming1, GAO Xiran2, CHEN Li2   

  1. 1 College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, China
    2 State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing 100190, China
  • Received:2023-11-20 Revised:2024-04-03 Online:2024-12-15 Published:2024-12-10
  • About author:FU You,born in 1968,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.26368M).Her main research interests include high performance computing,distributed computing and intelligent computing.
    CHEN Li,born in 1970,Ph.D,associate professor,is a member of CCF(No.05815M).Her main research interests include parallel programming languages and parallelizing compiling techniques.
  • Supported by:
    Natural Science Foundation of Shandong Province(ZR2022MF274, ZR2021LZH004)and National Key Research and Development Program of China(2017YFB0202002).

Abstract: China’s independently developed next-generation Sunway supercomputer features a more powerful memory system and higher computational density compared to its predecessor,the Sunway TaihuLight.Its primary programming model remains the bulk synchronous parallelism(BSP) model.The sequential task flow(STF) model,based on data flow information,automates the task parallelization of serial programs and achieves asynchronous parallelism through fine-grained synchronization between tasks.Compared to the global synchronization of the BSP model,STF offers higher parallelism and more balanced load distribution,providing users with a new option for efficiently utilizing the Sunway platform.However,on many-core systems,the runtime overhead of the STF model directly impacts the performance of parallel programs.This paper first analyzes two characteristics of the new Sunway processor that affect the efficient implementation of the STF model.Then,leveraging the unique features of the processor architecture,it proposes an agent-based dataflow graph construction mechanism to meet the modeling requirements and a lock-free centralized task scheduling mechanism to optimize scheduling overhead.Finally,based on these technologies,an efficient task flow parallel system is implemented for the AceMesh model.Experiments show that the implemented task flow parallel system has significant advantages over traditional runtime support,achieving a maximum speedup of 2.37 times in fine-grained task scenarios;the performance of AceMesh exceeds that of the OpenACC model on the Sunway platform,with a maximum speedup of 2.07 times for typical applications.

Key words: Sequential task flow model, Heterogeneous multi-core parallelism, Task scheduling, Dataflow parallelism, Bulk synchronous model

CLC Number: 

  • TP311
[1]DURAN A,PEREZ J M,AYGUADE E,et al.Extending the OpenMP tasking model to allow dependent tasks[C]//OpenMP in a New Era of Parallelism:4th International Workshop,IWOMP 2008 West Lafayette,USA,May 12-14,2008 Procee-dings 4.Berlin,Germany:Springer,2008:111-122.
[2]DURAN A,AYGUADE E,BADIA R M,et al.OmpSs:A proposal for programming heterogeneous multi-core architectures[J].Parallel Processing Letters,2011,21(2):173-193.
[3]OpenMP ARB.OpenMP application program interface version 4.0[R].The OpenMP Forum,2013.
[4]LEE J,SATO M.Implementation and performance evaluation of xcalableMP:A parallel programming language for distributed memory systems[C]//2010 39th International Conference on Parallel Processing Workshops.IEEE,2010:413-420.
[5]CHEN L,TANG S,FU Y,et al.AceMesh:A structured data driven programming language for high performance computing[J].CCF Transactions on High Performance Computing,2020,2:309-322.
[6]CHEN X,GAO Y,SHANG H,et al.Increasing the efficiency of massively parallel sparse matrix-matrix multiplication in first-principles calculation on the new-generation Sunway supercomputer[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):4752-4766.
[7]FU Y,WANG T,GUO Q,et al.Parallelization and optimization of Tend_Lin on Sunway TaihuLight system[J].Journal of Shandong University of Science and Technology(Natural Science),2019,38(2):90-99.
[8]GUO J,GAO X R,CHEN L.et al.Parallelizing multigrid application using data-driven programming model[J].Compurter Science,2020,47(8):32-40.
[9]YE Y X,FU Y,LIANG J G,et al.Composition optimizationmethod of AceMesh programming model on Sunway TaihuLight Platform[J].Journal of Shandong University of Science and Technology(Natural Science),2021,40(4):76-85.
[10]TANG X,ZHANG C,ZHAI J,et al.A fast lock for explicit message passing architectures[J].IEEE Transactions on Computers,2020,70(10):1555-1568.
[11]ÁLVAREZ D,SALA K,MARONAS M,et al.Advanced synchronization techniques for task-based runtime systems[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.New York,NY,USA:ACM,2021:334-347.
[12]PILLET V,LABARTA J,CORTES T,et al.Paraver:A tool to visualize and analyze parallel code[C]//Proceedings of WoTUG-18:Transputer and OCCAM Developments.Amsterdam:IOS Press,1995:17-31.
[13]AUGONNET C,THIBAULT S,NAMYST R,et al.StarPU:A unified platform for task scheduling on heterogeneous multicore architectures[C]//Euro-Par 2009 Parallel Processing:15th International Euro-Par Conference,Delft,The Netherlands,August 25-28,2009.Berlin,Germany:Springer,2009:863-874.
[14]CAO C,HERAULT T,BOSILCA G,et al.Design for a soft error resilient dynamic task-based runtime[C]//2015 IEEE International Parallel and Distributed Processing Symposium.IEEE,2015:765-774.
[15]YARKHAN A,KURZAK J,DONGARRA J.Quark users’guide:Queueing and runtime forkernels:Technical Report:ICL-UT-11-02[R].University of Tennessee Innovative Computing Laboratory,2011.
[16]TILLENIUS M.Scientific computing on multicore architectures[D].Sweden:Uppsala University,2014.
[17]VANDIERENDONCK H,TZENAKIS G,NIKOLOPOULOS D S.Analysis of dependence tracking algorithms for task dataflow execution[J].ACM Transactions on Architecture and Code Optimization(TACO),2013,10(4):1-24.
[18]BOSCH J,ÁLVAREZ C,JIMENEZ-GONZALEZ D,et al.Asynchronous runtime with distributed manager for task-based programming models[J].Parallel Computing,2020,97:102664.
[19]CASTES C,AGULLO E,AUMAGE O,et al.Decentralized in-order execution of a sequential task-based code for shared-memory architectures[C]//2022 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).IEEE,2022:552-561.
[20]WANG Y,ZHANG Y,SU Y,et al.An adaptive and hierarchical task scheduling scheme for multi-core clusters[J].Parallel Computing,2014,40(10):611-627.
[21]MUDDUKRISHNA A,JONSSON P A,BRORSSON M.Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors[J].Scienti-fic Programming,2016,2015:5.
[22]OLIVIER S L,PORTERFIELD A K,WHEELER K B,et al.OpenMP task scheduling strategies for multicore NUMA systems[J].The International Journal of High Performance Computing Applications,2012,26(2):110-124.
[23]NOOKALA P,DINDA P,HALE K C,et al.Enabling extremelyfine-grained parallelism via scalable concurrent queues on mo-dern many-core architectures[C]//2021 29th International Symposium on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems(MASCOTS).IEEE,2021:1-8.
[1] LI Danyang, WU Liangji, LIU Hui, JIANG Jingqing. Deep Reinforcement Learning Based Thermal Awareness Energy Consumption OptimizationMethod for Data Centers [J]. Computer Science, 2024, 51(6A): 230500109-8.
[2] LIU Chenwei, SUN Jian, LEI Bingbing, XU Tao, WU Zhuiwei. Task Scheduling Strategy for Energy Consumption Optimization of Cloud Data Center Based on Improved Particle Swarm Algorithm [J]. Computer Science, 2023, 50(7): 246-253.
[3] HU Shengxi, SONG Rirong, CHEN Xing, CHEN Zheyi. Dependency-aware Task Scheduling in Cloud-Edge Collaborative Computing Based on Reinforcement Learning [J]. Computer Science, 2023, 50(11A): 220900076-8.
[4] TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[5] SHEN Biao, SHEN Li-wei, LI Yi. Dynamic Task Scheduling Method for Space Crowdsourcing [J]. Computer Science, 2022, 49(2): 231-240.
[6] MA Xin-yu, JIANG Chun-mao, HUANG Chun-mei. Optimal Scheduling of Cloud Task Based on Three-way Clustering [J]. Computer Science, 2022, 49(11A): 211100139-7.
[7] LIU Wen-wen, XIONG Wei, HAN Chi. Communication Satellite Task Relaxation Scheduling Method Based on Improved Hyper-heuristic Algorithm [J]. Computer Science, 2022, 49(11A): 210900125-6.
[8] WANG Zheng, JIANG Chun-mao. Cloud Task Scheduling Algorithm Based on Three-way Decisions [J]. Computer Science, 2021, 48(6A): 420-426.
[9] CAI Ling-feng, WEI Xiang-lin, XING Chang-you, ZOU Xia, ZHANG Guo-min. Failure-resilient DAG Task Rescheduling in Edge Computing [J]. Computer Science, 2021, 48(10): 334-342.
[10] ZHANG Long-xin, ZHOU Li-qian, WEN Hong, XIAO Man-sheng, DENG Xiao-jun. Energy Efficient Scheduling Algorithm of Workflows with Cost Constraint in Heterogeneous Cloud Computing Systems [J]. Computer Science, 2020, 47(8): 112-118.
[11] SUN Min, CHEN Zhong-xiong, YE Qiao-nan. Workflow Scheduling Strategy Based on HEDSM Under Cloud Environment [J]. Computer Science, 2020, 47(6): 252-259.
[12] HU Jun-qin, ZHANG Jia-jun, HUANG Yin-hao, CHEN Xing, LIN Bing. Computation Offloading Scheduling Technology for DNN Applications in Edge Environment [J]. Computer Science, 2020, 47(10): 247-255.
[13] ZHANG Zhou, HUANG Guo-rui, JIN Pei-quan. Task Scheduling on Storm:Current Situations and Research Prospects [J]. Computer Science, 2019, 46(9): 28-35.
[14] ZENG Jin-jing, ZHANG Jian-shan, LIN Bing, ZHANG Wen-de. Cloudlet Workload Balancing Algorithm in Wireless Metropolitan Area Networks [J]. Computer Science, 2019, 46(8): 163-170.
[15] ZHANG Jian-shan, LIN Bing, LU Yu, XU Fu-rong. Cloudlet Placement and User Task Scheduling Based on Wireless Metropolitan Area Networks [J]. Computer Science, 2019, 46(6): 128-134.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!