计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 98-104.doi: 10.11896/jsjkx.191000013

所属专题: 高性能计算

• 高性能计算 • 上一篇    下一篇

面向国产异构众核处理器SW26010的BFS优化方法

袁欣辉, 林蓉芬, 魏迪, 尹万旺, 徐金秀   

  1. 江南计算技术研究所 江苏 无锡 214083
  • 出版日期:2020-08-15 发布日期:2020-08-10
  • 通讯作者: 袁欣辉(yuanxh07@foxmail.com)
  • 基金资助:
    国家重点研发计划资助项目(2016YFB0201100, 2017YFB0202702);国家“973”计划资助项目(2014CB744100);国家“863”计划资助项目(2012AA01A306)

Optimization of BFS on Domestic Heterogeneous Many-core Processor SW26010

YUAN Xin-hui, LIN Rong-fen, WEI Di, YIN Wan-wang, XU Jin-xiu   

  1. Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214083, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:YUAN Xin-hui, born in 1989, master, research associate.His main research interests include parallel algorithm design and optimization, MPI and benchmark optimization.
  • Supported by:
    This work was supported by the National Key Research and Development Project of China(2016YFB0201100, 2017YFB0202702), National Basic Research Program of China(2014CB744100) and National High Technology Research and Development Program of China (2012AA01A306).

摘要: 近年来, 人们越来越关注计算机对数据密集型课题的处理能力。宽度优先搜索(Breadth First Search, BFS)是一种典型的数据密集型课题, 被广泛应用于多种图算法。Graph 500 Benchmark以BFS搜索为核心算法, 已经成为评价计算机处理大数据能力的基准。神威太湖之光超级计算机从2016年6月至2017年11月连续4次荣登Top 500榜单榜首, 其处理器SW26010是首款由我国自主研制的异构众核处理器。文中研究了如何利用SW26010的体系结构特点加速BFS算法的问题, 在SW26010上实现了基于单个核组的方向优化的融合BFS算法, 使用字节图(bytemap)释放内层循环依赖性, 利用异步DMA隐藏计算与便签存储器的访问开销, 利用异构架构协同运算并对图做预处理。最终, 以Graph 500作为基准测试程序处理scale为22的图, SW26010处理器单核组BFS的性能达到457.54MTEPS。

关键词: Graph 500, SW26010, 宽度优先搜索, 神威太湖之光, 数据密集, 异构众核

Abstract: In recent years, there is growing concern for the processing capabilities of data-intensive task.Breadth-first search(BFS) is a typical data-intensive problem, which is widely used in a variety of graph algorithms.Graph 500 Benchmark, taking BFS algorithm as the core, has become the benchmark for the evaluation of processing capabilities of data-intensive tasks.Sunway TaihuLight supercomputer topped the Top 500 list for four consecutive times from June 2016 to November 2017, the processor of which, named SW26010, is the first Chinese homegrown heterogeneous many-core processor.This paper studies how to use the architecture characteristics of SW26010 to accelerate BFS algorithm.A direction-optimizing hybrid BFS algorithm based on a single core group(CG) is implemented on SW26010, using bytemap to release the data dependencies in inner loops, hiding overhead of calculation and SPM access by using asynchronous DMA, taking advantage of heterogeneous architecture to compute collaboratively and carrying out graph preprocessing.Eventually, with Graph 500 as the benchmark processing a scale 22 graph, a single CG of SW26010 processor achieves a performance of 457.54MTEPS.

Key words: Breadth First Search, Data-intensive, Graph 500, Heterogeneous many-core, Sunway TaihuLight, SW26010

中图分类号: 

  • TP311
[1] Graph 500[OL].http://www.graph500.org.
[2] AGARWAL V, PETRINI F, PASETTO D, et al.Scalable Graph Exploration on Multicore Processors∥International Conference for High Performance Computing, Networking, Storage & Analysis.IEEE, 2010:1-11.
[3] UENO K, SUZUMURA T.Highly scalable graph search for the Graph500 benchmark[C]∥International Symposium on High-performance Parallel & Distributed Computing.2012:149-160.
[4] TITHI J J, MATANI D, MENGHANI G, et al.Avoiding Locks and Atomic Instructions in Shared-Memory Parallel BFS Using Optimistic Parallelization[C]∥IEEE International Symposium on Parallel & Distributed Processing.IEEE, 2013:1628-1637.
[5] CHHUGANI J, SATISH N, KIM C, et al.Fast and EfficientGraph Traversal Algorithm for CPUs:Maximizing Single-Node Efficiency[C]∥IEEE International Parallel & Distributed Processing Symposium.2012:378-389.
[6] BEAMER S, BULUC A, ASANOVIC K, et al.DistributedMemory Breadth-First Search Revisited:Enabling Bottom-Up Search[C]∥IEEE International Parallel & Distributed Proces-sing Symposium Workshops &Phd Forum.2013:1618-1627.
[7] YASUI Y, FUJISAWA K, GOTO K.NUMA-optimized parallelbreadth-first search on multicore single-node system[C]∥IEEE International Conference on Big Data.IEEE, 2013:394-402.
[8] YASUI Y, FUJISAWA K, SATO Y.Fast and Energy-efficientBreadth-First Search on a Single NUMA System[M]∥Supercomputing.Cham:Springer, 2014:365-381.
[9] TAO G, YUTONG L, GUANG S.Using MIC to Accelerate aTypical Data-Intensive Application:The Breadth-first Search[C]∥IEEE International Symposium on Parallel & Distributed Processing.2013:1117-1125.
[10]BUSATO F, BOMBIERI N.BFS-4K:an Efficient Implementa-tion of BFS for Kepler GPU Architectures[J].IEEE Transactions on Parallel & Distributed Systems, 2014(1):1-1.
[11]ZOU D, DOU Y, WANG Q, et al.Direction-Optimizing Breadth-First Search on CPU-GPU Heterogeneous Platforms[C]∥IEEE International Conference on High Performance Computing & Communications & IEEE International Conference on Embedded & Ubiquitous Computing.IEEE, 2013:1064-1069.
[12]CHECCONI F, PETRINI F, WILLCOCK J, et al.Breaking thespeed and scalability Barriers for Graph exploration on distributed-memory machines[J].International Conference for High Performance Computing, Networking, Storage & Analysis, 2012, 7196(5):1-12.
[13]FU H, LIAO J, YANG J, et al.The Sunway TaihuLight supercomputer:system and applications[J].Sciece China Information Sciences, 2016, 59:1-16
[14]DUANE M, MICHAEL G, ANDREW G.Scalable GPU graph traversal[J].Acm Sigplan Notices, 2012, 47(8).
[15]FU Z S, HARISH K D, BRADLEY B, et al.Parallel breadth first search on GPU clusters[C]∥2014 IEEE International Conference on Big Data(Big Data).IEEE, 2014.
[16]PETER Z, ERIC H, JOHN M, et al.Dynamic parallelism for simple and efficient GPU graph algorithms[C]∥Proceedings of the 5th Workshop on Irregular Applications:Architectures and Algorithms.ACM, 2015.
[17]YANG H, HANG L, HUANG H H.High-performance triangle counting on gpus[C]∥2018 IEEE High Performance extreme Computing Conference(HPEC).IEEE, 2018.
[1] 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香.
面向国产异构众核架构的CFD非结构网格计算并行优化方法
Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture
计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[2] 朱雨, 庞建民, 徐金龙, 陶小涵, 王军.
面向SW26010处理器的三维Stencil自适应分块参数算法
Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor
计算机科学, 2021, 48(6): 10-18. https://doi.org/10.11896/jsjkx.200700059
[3] 何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵.
基于神威平台的Floyd并行算法的实现和优化
Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform
计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051
[4] 郭杰, 高希然, 陈莉, 傅游, 刘颖.
用数据驱动的编程模型并行多重网格应用
Parallelizing Multigrid Application Using Data-driven Programming Model
计算机科学, 2020, 47(8): 32-40. https://doi.org/10.11896/jsjkx.200500093
[5] 吕小敬, 刘钊, 褚学森, 石树鹏, 孟虹松, 黄震春.
面向超大规模并行模拟的LBM计算流体力学软件
Extreme-scale Simulation Based LBM Computing Fluid Dynamics Simulations
计算机科学, 2020, 47(4): 13-17. https://doi.org/10.11896/jsjkx.191000010
[6] 倪鸿, 刘鑫.
非结构网格下稀疏下三角方程求解器众核优化技术研究
Many-core Optimization for Sparse Triangular Solver Under Unstructured Grids
计算机科学, 2019, 46(6A): 518-522.
[7] 陶小涵, 庞建民, 高伟, 王琦, 姚金阳.
基于SW26010处理器的FT程序的性能优化
Performance Optimization of FT Program Based on SW26010 Processor
计算机科学, 2019, 46(4): 321-328. https://doi.org/10.11896/j.issn.1002-137X.2019.04.050
[8] 姚庆, 郑凯, 刘垚, 王肃, 孙军, 徐梦轩.
SOM算法在申威众核上的实现和优化
Implementation and Optimization of SOM Algorithm on Sunway Many-core Processors
计算机科学, 2018, 45(11A): 591-596.
[9] 洪菁.
大数据时代的思维特点研究
Research on Characteristics of Thinking in Era of Big Data
计算机科学, 2016, 43(Z11): 472-473. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.106
[10] 陈伟,SMIELIAUSKAS Wally.
大数据环境下的电子数据审计:机遇、挑战与方法
Opportunities,Challenges and Methods of Electric Data Auditing in Big Data Environments
计算机科学, 2016, 43(1): 8-13. https://doi.org/10.11896/j.issn.1002-137X.2016.01.002
[11] 许瑾晨,郭绍忠,黄永忠,王磊.
面向异构众核从核的数学函数库访存优化方法
Access Optimization Technique for Mathematical Library of Slave Processors on Heterogeneous Many-core Architectures
计算机科学, 2014, 41(6): 12-17. https://doi.org/10.11896/j.issn.1002-137X.2014.06.003
[12] 张建勋,古志民.
帮助线程预取技术研究综述
Survey of Helper Thread Prefetching
计算机科学, 2013, 40(7): 19-23.
[13] 杜薇,崔国华,刘伟,石飞燕,位凯志.
云环境下面向数据密集型应用的数据选择策略研究
Data Selection Strategy for Data-intensive Applications in Cloud
计算机科学, 2012, 39(6): 30-34.
[14] 薛卫,梁敬东,林金星.
基于肤色信息与宽度优先搜索的AAM人脸特征定位算法
AAM Facial Feature Localization Algorithm Based on Skin Model and Breadth-First Search
计算机科学, 2011, 38(8): 275-277.
[15] 石宣化 金海.
有服务质量保证的数据密集型网格应用管理研究

计算机科学, 2007, 34(6): 131-135.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!