计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 52-58.doi: 10.11896/jsjkx.210800091

• 高性能计算* 上一篇    下一篇

一种面向申威26010处理器的分布式传递锁机制

李明亮, 庞建民, 岳峰   

  1. 数学工程与先进计算国家重点实验室(信息工程大学) 郑州 450000
  • 收稿日期:2021-08-11 修回日期:2022-01-07 出版日期:2022-10-15 发布日期:2022-10-13
  • 通讯作者: 庞建民(jianmin_pang@126.com)
  • 作者简介:(lmliang_only@163.com)
  • 基金资助:
    国家自然科学基金(61472447,61802433,61802435)

Distributed Lock with Inter-core Passing for SW26010 Processor

LI Ming-liang, PANG Jian-min, YUE Feng   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing,PLA Information Engineering University,Zhengzhou 450000,China
  • Received:2021-08-11 Revised:2022-01-07 Online:2022-10-15 Published:2022-10-13
  • About author:LI Ming-liang,born in 1991,Ph.D.His main research interests include high-performance computing and binary translation.
    PANG Jian-min,born in 1964,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include high-performance computing and information security.
  • Supported by:
    National Natural Science Foundation of China(61472447,61802433,61802435).

摘要: 在并行程序中,互斥锁通常被用来避免访问共享资源时发生冲突。申威26010处理器是“神威·太湖之光”超级计算机采用的异构众核处理器,众核之间并无硬件互斥锁机制。其开发人员基于原子操作实现了一种软件互斥锁,但是该软件锁在激烈锁竞争情况下会产生大量的锁操作开销,影响了并行程序的性能。针对这一问题,提出了一种分布式传递锁机制HDT-LOCK。首先,提出并实现了基于众核上便签存储器和主存的混合分布锁来避免访存拥塞;其次,设计了基于寄存器通信和单指令多数据指令(Single-instruction Multiple-data Instruction)的锁传递机制,以进一步提高HDT-LOCK机制的吞吐量。实验结果表明,与原锁机制相比,所提HDT-LOCK机制避免了访存拥塞,并且可扩展性更佳。此外,锁传递机制使HDT-LOCK的吞吐量提升最高可达5.6倍。

关键词: 申威26010处理器, 混合分布锁, 锁传递, 单指令多数据指令, 寄存器通信

Abstract: In parallel programs,a mutual exclusive lock is often used to avoid conflict when accessing shared resources.The SW26010 processor,which is deployed on the Sunway TaihuLight supercomputer,is a heterogeneous many-core processor and there is no hardware lock mechanism for the co-processing cores.Developers have developed a software lock mechanism based on atomic instructions,but the software lock will lead to significant overhead and affect the performance of parallel programs.To solve this issue,the HDT-LOCK designed as distributed lock mechanism with inter-core passing is proposed.Firstly,the hybrid distributed lock is proposed and implemented based on scratchpad memory on co-processing cores to mitigate memory congestion.Furthermore,the inter-core passing mechanism using register communication and the single-instruction multiple-data instruction is developed to improve the throughput of HDT-LOCK.Experimental results show that the proposed HDT-LOCK mechanism mitigates memory congestion,and has better scalability.In addition,the lock passing mechanism improves HDT-LOCK throughput up to 5.6X.

Key words: SW26010 processor, Hybrid distributed lock, Inter-core passing, Single-instruction multiple-data instruction, Register communication

中图分类号: 

  • TP319
[1]ZHU Y,PANG J M,XU J L,et al.Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor[J].Computer Science,2021,48(6):10-18.
[2]TAO X H,PANG J M,GAO W,et al.Performance Optimization of FT Program Based on SW26010 Processor[J].Computer Science,2019,46(4):321-328.
[3]WIENKE S,SPRINGER P,TERBOVEN C,et al.OpenACC—first experiences with real-world applications[C]//European Conference on Parallel Processing.Berlin:Springer,2012:859-870.
[4]DALESSANDRO L,DICE D,SCOTT M,et al.Transactionalmutex locks[C]//European Conference on Parallel Processing.Berlin:Springer,2010:2-13.
[5]ALFRANSEDER M,DEUBZER M,JUSTUS B,et al.An efficient spin-lock based multi-core resource sharing protocol[C]//2014 IEEE 33rd International Performance Computing and Communications Conference (IPCCC).Piscataway:IEEE Press,2014:1-7.
[6]DUAN X H.Optimization of Molecular Dynamics AlgorithmsBased on the Sunway TaihuLight Supercomputer[D].Jinan:Shandong University,2020.
[7]CHABBI M,FAGAN M,MELLOR-CRUMMEY J.High performance locks for multi-level NUMA systems[J].ACM SIGPLAN Notices,2015,50(8):215-226.
[8]DICE D.Malthusian locks[C]//Proceedings of the 12th Euro-pean Conference on Computer Systems (EuroSys'17).New York:ACM,2017:314-327.
[9]DICE D,MARATHE V J,SHAVIT N.Lock cohorting:A ge-neral technique for designing NUMA locks[J].ACM Transactions on Parallel Computing (TOPC),2015,1(2):1-42.
[10]FU H,LIAO J,YANG J,et al.The Sunway TaihuLight supercomputer:system and applications [J].Science China-Information Sciences,2016,59(7):1-16.
[11]CHEN D X,LIU X.Parallel programming and optimization of Sunway TaihuLight[M].Wuxi:National Parallel Computer Engineering Technology Research Center,2017.
[12]EPCC.EPCC OpenACC Benchmarks[EB/OL].(2013-09-23) [2021-08-10].https://github.com/EPCCed/epcc-openacc-bench-marks.
[13]ANDERSON T E.The performance of spin lock alternatives for shared-memory multiprocessors[J].IEEE Transactions on Pa-rallel and Distributed Systems,1990,1(1):6-16.
[14]KWAK B J,SONG N O,MILLER L E.Performance analysis of exponential backoff[J].IEEE/ACM Transactions on Networking,2005,13(2):343-355.
[15]CRAIG T.Building FIFO and priorityqueuing spin locks fromatomic swap:Technical Report TR 93-02-02[R].Seattle:Department of Computer Science,University of Washington,1993.
[16]GOPALAKRISHNA K,LU S,ZHANG Z,et al.Untanglingcluster management with Helix[C]//Proceedings of the Third ACM Symposium on Cloud Computing.New York:ACM,2012:1-13.
[17]FRANCE-PILLOIS M,MARTIN J,ROUSSEAU F.Implementation and evaluation of a hardware decentralized synchronization lock for MPSoCs[C]//2020 IEEE International Parallel and Distributed Processing Symposium(IPDPS).Piscataway:IEEE Press,2020:1112-1121.
[18]TANG X,ZHAI J,QIAN X,et al.plock:A fast lock for architectures with explicit inter-core message passing[C]//Procee-dings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.New York:ACM,2019:765-778.
[1] 陶小涵, 庞建民, 高伟, 王琦, 姚金阳.
基于SW26010处理器的FT程序的性能优化
Performance Optimization of FT Program Based on SW26010 Processor
计算机科学, 2019, 46(4): 321-328. https://doi.org/10.11896/j.issn.1002-137X.2019.04.050
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!