Computer Science ›› 2020, Vol. 47 ›› Issue (8): 98-104.doi: 10.11896/jsjkx.191000013

;

Previous Articles     Next Articles

Optimization of BFS on Domestic Heterogeneous Many-core Processor SW26010

YUAN Xin-hui, LIN Rong-fen, WEI Di, YIN Wan-wang, XU Jin-xiu   

  1. Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214083, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:YUAN Xin-hui, born in 1989, master, research associate.His main research interests include parallel algorithm design and optimization, MPI and benchmark optimization.
  • Supported by:
    This work was supported by the National Key Research and Development Project of China(2016YFB0201100, 2017YFB0202702), National Basic Research Program of China(2014CB744100) and National High Technology Research and Development Program of China (2012AA01A306).

Abstract: In recent years, there is growing concern for the processing capabilities of data-intensive task.Breadth-first search(BFS) is a typical data-intensive problem, which is widely used in a variety of graph algorithms.Graph 500 Benchmark, taking BFS algorithm as the core, has become the benchmark for the evaluation of processing capabilities of data-intensive tasks.Sunway TaihuLight supercomputer topped the Top 500 list for four consecutive times from June 2016 to November 2017, the processor of which, named SW26010, is the first Chinese homegrown heterogeneous many-core processor.This paper studies how to use the architecture characteristics of SW26010 to accelerate BFS algorithm.A direction-optimizing hybrid BFS algorithm based on a single core group(CG) is implemented on SW26010, using bytemap to release the data dependencies in inner loops, hiding overhead of calculation and SPM access by using asynchronous DMA, taking advantage of heterogeneous architecture to compute collaboratively and carrying out graph preprocessing.Eventually, with Graph 500 as the benchmark processing a scale 22 graph, a single CG of SW26010 processor achieves a performance of 457.54MTEPS.

Key words: Breadth First Search, Data-intensive, Graph 500, Heterogeneous many-core, Sunway TaihuLight, SW26010

CLC Number: 

  • TP311
[1] Graph 500[OL].http://www.graph500.org.
[2] AGARWAL V, PETRINI F, PASETTO D, et al.Scalable Graph Exploration on Multicore Processors∥International Conference for High Performance Computing, Networking, Storage & Analysis.IEEE, 2010:1-11.
[3] UENO K, SUZUMURA T.Highly scalable graph search for the Graph500 benchmark[C]∥International Symposium on High-performance Parallel & Distributed Computing.2012:149-160.
[4] TITHI J J, MATANI D, MENGHANI G, et al.Avoiding Locks and Atomic Instructions in Shared-Memory Parallel BFS Using Optimistic Parallelization[C]∥IEEE International Symposium on Parallel & Distributed Processing.IEEE, 2013:1628-1637.
[5] CHHUGANI J, SATISH N, KIM C, et al.Fast and EfficientGraph Traversal Algorithm for CPUs:Maximizing Single-Node Efficiency[C]∥IEEE International Parallel & Distributed Processing Symposium.2012:378-389.
[6] BEAMER S, BULUC A, ASANOVIC K, et al.DistributedMemory Breadth-First Search Revisited:Enabling Bottom-Up Search[C]∥IEEE International Parallel & Distributed Proces-sing Symposium Workshops &Phd Forum.2013:1618-1627.
[7] YASUI Y, FUJISAWA K, GOTO K.NUMA-optimized parallelbreadth-first search on multicore single-node system[C]∥IEEE International Conference on Big Data.IEEE, 2013:394-402.
[8] YASUI Y, FUJISAWA K, SATO Y.Fast and Energy-efficientBreadth-First Search on a Single NUMA System[M]∥Supercomputing.Cham:Springer, 2014:365-381.
[9] TAO G, YUTONG L, GUANG S.Using MIC to Accelerate aTypical Data-Intensive Application:The Breadth-first Search[C]∥IEEE International Symposium on Parallel & Distributed Processing.2013:1117-1125.
[10]BUSATO F, BOMBIERI N.BFS-4K:an Efficient Implementa-tion of BFS for Kepler GPU Architectures[J].IEEE Transactions on Parallel & Distributed Systems, 2014(1):1-1.
[11]ZOU D, DOU Y, WANG Q, et al.Direction-Optimizing Breadth-First Search on CPU-GPU Heterogeneous Platforms[C]∥IEEE International Conference on High Performance Computing & Communications & IEEE International Conference on Embedded & Ubiquitous Computing.IEEE, 2013:1064-1069.
[12]CHECCONI F, PETRINI F, WILLCOCK J, et al.Breaking thespeed and scalability Barriers for Graph exploration on distributed-memory machines[J].International Conference for High Performance Computing, Networking, Storage & Analysis, 2012, 7196(5):1-12.
[13]FU H, LIAO J, YANG J, et al.The Sunway TaihuLight supercomputer:system and applications[J].Sciece China Information Sciences, 2016, 59:1-16
[14]DUANE M, MICHAEL G, ANDREW G.Scalable GPU graph traversal[J].Acm Sigplan Notices, 2012, 47(8).
[15]FU Z S, HARISH K D, BRADLEY B, et al.Parallel breadth first search on GPU clusters[C]∥2014 IEEE International Conference on Big Data(Big Data).IEEE, 2014.
[16]PETER Z, ERIC H, JOHN M, et al.Dynamic parallelism for simple and efficient GPU graph algorithms[C]∥Proceedings of the 5th Workshop on Irregular Applications:Architectures and Algorithms.ACM, 2015.
[17]YANG H, HANG L, HUANG H H.High-performance triangle counting on gpus[C]∥2018 IEEE High Performance extreme Computing Conference(HPEC).IEEE, 2018.
[1] CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang. Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture [J]. Computer Science, 2022, 49(6): 99-107.
[2] ZHU Yu, PANG Jian-min, XU Jin-long, TAO Xiao-han, WANG Jun. Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor [J]. Computer Science, 2021, 48(6): 10-18.
[3] HE Ya-ru, PANG Jian-min, XU Jin-long, ZHU Yu, TAO Xiao-han. Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform [J]. Computer Science, 2021, 48(6): 34-40.
[4] GUO Jie, GAO Xi-ran, CHEN Li, FU You, LIU Ying. Parallelizing Multigrid Application Using Data-driven Programming Model [J]. Computer Science, 2020, 47(8): 32-40.
[5] JIN Qi, WANG Jun-chang, FU Xiong. Cuckoo Hash Table Based on Smart Placement Strategy [J]. Computer Science, 2020, 47(8): 80-86.
[6] LIU Xiao-nan, JING Li-na, WANG Li-xin, WANG Mei-ling. Large-scale Quantum Fourier Transform Simulation Based on SW26010 [J]. Computer Science, 2020, 47(8): 93-97.
[7] LV Xiao-jing, LIU Zhao, CHU Xue-sen, SHI Shu-peng, MENG Hong-song, HUANG Zhen-chun. Extreme-scale Simulation Based LBM Computing Fluid Dynamics Simulations [J]. Computer Science, 2020, 47(4): 13-17.
[8] NI Hong, LIU Xin. Many-core Optimization for Sparse Triangular Solver Under Unstructured Grids [J]. Computer Science, 2019, 46(6A): 518-522.
[9] TAO Xiao-han, PANG Jian-min, GAO Wei, WANG Qi, YAO Jin-yang. Performance Optimization of FT Program Based on SW26010 Processor [J]. Computer Science, 2019, 46(4): 321-328.
[10] YAO Qing, ZHENG Kai, LIU Yao, WANG Su, SUN Jun, XU Meng-xuan. Implementation and Optimization of SOM Algorithm on Sunway Many-core Processors [J]. Computer Science, 2018, 45(11A): 591-596.
[11] MENG De-long, WEN Min-hua, WEI Jian-wen and James LIN. Porting and Optimizing OpenFOAM on Sunway TaihuLight System [J]. Computer Science, 2017, 44(10): 64-70.
[12] WANG Li-ming, JIANG Qin and ZHANG Zhuo. Incremental Algorithm for Constructing Fuzzy Concept Lattices Based on Attributes Decrement [J]. Computer Science, 2016, 43(8): 216-222.
[13] CHEN Wei and SMIELIAUSKAS Wally. Opportunities,Challenges and Methods of Electric Data Auditing in Big Data Environments [J]. Computer Science, 2016, 43(1): 8-13.
[14] XU Jin-chen,GUO Shao-zhong,HUANG Yong-zhong and WANG Lei. Access Optimization Technique for Mathematical Library of Slave Processors on Heterogeneous Many-core Architectures [J]. Computer Science, 2014, 41(6): 12-17.
[15] . Data Selection Strategy for Data-intensive Applications in Cloud [J]. Computer Science, 2012, 39(6): 30-34.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!