计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 306-314.doi: 10.11896/j.issn.1002-137X.2019.02.047

• 交叉与前沿 • 上一篇    下一篇

面向GPU计算平台的归约算法的性能优化研究

张逸然1, 陈龙2, 安向哲2, 颜深根3   

  1. 北京信息科技大学 北京1000491
    中国石油集团东方地球物理勘探有限责任公司 河北 涿州0727512
    深圳市商汤科技有限公司 广东 深圳5180003
  • 收稿日期:2018-09-12 出版日期:2019-02-25 发布日期:2019-02-25
  • 作者简介:张逸然(1999-),男,主要研究方向为高性能计算、众核编程与优化方法;陈 龙(1973-),男,硕士,主要研究方向为高性能计算、人工智能在勘探行业的应用;安向哲(1974-),硕士,主要研究方向为云计算;颜深根(1985-),男,博士,主要研究方向为GPU编程与优化、自适应性能调优、大规模分布式深度学习等。

Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform

ZHANG Yi-ran1, CHEN Long2, AN Xiang-zhe2, YAN Shen-gen3   

  1. Beijing Information Science & Technology University,Beijing 100049,China1
    BGP INC.,China National Petroleum Corporation,Zhuozhou,Hebei 072751,China2
    SenseTime,Shenzhen,Guangdong 518000,China3
  • Received:2018-09-12 Online:2019-02-25 Published:2019-02-25

摘要: 归约算法在科学计算和图像处理等领域有着十分广泛的应用,是并行计算的基本算法之一,因此对归约算法进行加速具有重要意义。为了充分挖掘异构计算平台下GPU的计算能力以对归约算法进行加速,文中提出基于线程内归约、work-group内归约和work-group间归约3个层面的归约优化方法,并打破以往相关工作将优化重心集中在work-group内归约上的传统思维,通过论证指出线程内归约才是归约算法的瓶颈所在。实验结果表明,在不同的数据规模下,所提归约算法与经过精心优化的OpenCV库的CPU版本相比,在AMD W8000和NVIDIA Tesla K20M平台上分别达到了3.91~15.93和2.97~20.24的加速比;相比于OpenCV库的CUDA版本与OpenCL版本,在NVIDIA Tesla K20M平台上分别达到了2.25~5.97和1.25~1.75的加速比;相比于OpenCL版本,在AMD W8000平台上达到了1.24~5.15的加速比。文中工作不仅实现了归约算法在GPU计算平台上的高性能,而且实现了在不同GPU计算平台间的性能可移植。

关键词: GPU, OpenCL, 归约算法, 线程内归约

Abstract: Reduction algorithm has wide application in scientific computing and image processing,and it is one of the basic algorithms of parallel computing.Hence,it is significant to accelerate reduction algorithm.In order to fully exploit the capability of GPU for general-purpose computing under heterogeneous processing platform,this paper proposed a multi-level reduction optimization algorithm including inner-thread reduction,inner-work-group reduction and inter-work-group reduction.Different from the traditional way of reduction algorithm optimization of putting more emphasis on inner-work-group reduction,this paper proved that inner-thread reduction is the true bottleneck of reduction algorithm.The experimental results demonstrate that the performance of proposed reduction algorithm has reached 3.91~15.93 and 2.97~20.24 times speedup respectively in AMD W8000 and NVIDIA Tesla K20M under different sizes of data set,compared with carefully optimized CPU version of OpenCV library.In NVIDIA Tesla K20M,compared with CUDA version and OpenCL version of OpenCV library,the algorithm has reached 2.25~5.97 and 1.25~1.75 times speedup respectively.And compared with OpenCL version of OpenCV library in AMD W8000,the algorithm has reached 1.24~5.15 times speedup.This work not only realizes high performance of reduction algorithm on GPU platform,but also reaches the portability of performance between different GPU computing platforms.

Key words: GPU, Inner threads reduction, OpenCL, Reduction algorithm

中图分类号: 

  • TP391
[1]HARRIS M.Optimizing parallel reduction in CUDA[OL].http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf.
[2]CATANZARO B.Opencl optimization case study:Simple reductions[OL].https://developer.amd.com/article/opencl-optimization-case-study-simple-reductions.
[3]YAN S G,YUNQUAN Z,GUOPING L.Reduction algorithm optimization based on the OpenCL[J].Journal of Software,2011,22(S2):163-171.
[4]OpenCV Library Open Source Project[OL].http://opencv. org.
[5]BRADSKI G,KAEHLER A.Learning OpenCV:Computer vision with the OpenCV library[M].O’Reilly Media,Inc.,2008.
[6]RAWAT P S,RASTELLO F,SUKUMARAN-RAJAM A,et al.Register optimizations for stencils on GPUs[C]∥Procee-dings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18).New York,NY,USA,2018.
[7]CECKA C.Low communication FMM-accelerated FFT on GPUs [C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis (SC’17).ACM,New York,NY,USA,2017.
[8]ZHANG X X,TAN G M,XUE S B,et al.Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning[C]∥Proceedings of the 22nd ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming (PPoPP’17).New York,NY,USA,2017.
[9]LI B,SUN J,ANNAVARAM M,et al.Elastic-Cache:GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management[C]∥Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Orlando,FL,2017.
[10]KIRK D B,WEN-MEI W H.Programming massively parallel processors:a hands-on approach[M].Newnes,2012.
[11]WU E H,LIU Y Q.The General Computing based on GPU[J].Proceeding of Journal of Computer-Aided Design and Computer Graphics,2004,16(5):601-612.
[12]OWENS J D,HOUSTON M,LUEBKE D.et al.GPU Computing[J].Proceedings of the IEEE,2008,96(5):879-899.
[13]OWENS J D,LUEBKE D,GOVINDARAJU N,et al.A survey of general-purpose computation on graphics hardware[J].Computer Graphics Forum:Journal of the European Association for Computer Graphics,2007,26(1):80-113.
[14]Compute unified device architecture programming guide[OL].http://www.nvidia.com/object/cuda_home.html.
[15]Nvidia cuda C.programming guide[OL].https://docs.nvidia.com/cuda/cuda-c-programming-guide.
[16]NVIDIA cuda C.Best practices guide[OL].https://docs. nvidia.com/cuda/cuda-c-best-practices-guide.
[17]The opencl 1.2 speciffication[M].Khronos OpenCL Working Group,2012.
[18]STONE J E,GOHARA D,SHI G.OpenCL:A parallel programming standard for heterogeneous computing systems[J].Computing in Science & Engineering,2010,12(1-3):66-73.
[19]GASTER B,HOWES L,KAELI D R,et al.Heterogeneous Computing with OpenCL:Revised OpenCL 1[M].Newnes,2012.
[20]ZOU H,WANG H Q,HUANG Y.GPU Accelerated Rainbow Tables Analysis of MD5 Hash Password[J].Journal of Chongqing University of Technology(Natural Science),2013,27(7):61-66.(in Chinese)
邹航,王华秋,黄勇.基于GPU加速的彩虹表分析MD5哈希密码[J].重庆理工大学学报(自然科学),2013,27(7):61-66.
[1] 宗迪迪, 谢益武.
基于法线迭代的模型中轴生成方法
Model Medial Axis Generation Method Based on Normal Iteration
计算机科学, 2022, 49(6A): 764-770. https://doi.org/10.11896/jsjkx.210400050
[2] 汪晋, 刘江.
基于GPU的并行DILU预处理技术
GPU-based Parallel DILU Preconditioning Technique
计算机科学, 2022, 49(6): 108-118. https://doi.org/10.11896/jsjkx.210300259
[3] 李繁, 严星, 张晓宇.
基于GPU的特征脸算法优化研究
Optimization of GPU-based Eigenface Algorithm
计算机科学, 2021, 48(4): 197-204. https://doi.org/10.11896/jsjkx.200600033
[4] 齐延荣, 周夏冰, 李斌, 周清雷.
基于FPGA的CNN图像识别加速与优化
FPGA-based CNN Image Recognition Acceleration and Optimization
计算机科学, 2021, 48(4): 205-212. https://doi.org/10.11896/jsjkx.200600089
[5] 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立.
基于GPU加速的并行WMD算法
Parallel WMD Algorithm Based on GPU Acceleration
计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213
[6] 刘世芳, 赵永华, 于天禹, 黄荣锋.
广义稠密对称特征问题标准化算法在GPU集群上的有效实现
Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster
计算机科学, 2020, 47(4): 6-12. https://doi.org/10.11896/jsjkx.191000009
[7] 左宪禹, 张哲, 苏岳瀚, 刘扬, 葛强, 田军锋.
基于GPU多流并发并行模型的NDVI提取算法
Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model
计算机科学, 2020, 47(4): 25-29. https://doi.org/10.11896/jsjkx.190500029
[8] 康林瑶, 唐兵, 夏艳敏, 张黎.
基于GPU加速和非负矩阵分解的并行协同过滤推荐算法
GPU-accelerated Non-negative Matrix Factorization-based Parallel Collaborative Filtering Recommendation Algorithm
计算机科学, 2019, 46(8): 106-110. https://doi.org/10.11896/j.issn.1002-137X.2019.08.017
[9] 苏庆华, 付景超, 谷焓, 张姗姗, 李奕飞, 江方舟, 白翰林, 赵地.
前列腺癌辅助诊断GPU并行算法设计
Parallel Algorithm Design for Assisted Diagnosis of Prostate Cancer
计算机科学, 2019, 46(11A): 524-527.
[10] 卢献华, 王洪俊.
基于大数据计算框架的分布式新闻聚类系统设计
Design of Distributed News Clustering System Based on Big Data Computing Framework
计算机科学, 2019, 46(11A): 220-223.
[11] 邓定胜.
一种基于可编程GPU的实时烟雾模拟算法研究
Study on Real-time Smoke Simulation Algorithm Based on Programmable GPU
计算机科学, 2019, 46(11A): 604-608.
[12] 刘玉成, 理查德·丁, 张颖超.
一种BPNNs识别算法的医学检测泛实时性问题研究
Research on Pan-real-time Problem of Medical Detection Based on BPNNs Recognition Algorithm
计算机科学, 2018, 45(6): 301-307. https://doi.org/10.11896/j.issn.1002-137X.2018.06.053
[13] 廖星,袁景凌,陈旻骋.
一种自适应权重的并行PSO快速装箱算法
Parallel PSO Container Packing Algorithm with Adaptive Weight
计算机科学, 2018, 45(3): 231-234. https://doi.org/10.11896/j.issn.1002-137X.2018.03.036
[14] 武昱, 闫光辉, 王雅斐, 马青青, 刘宇轩.
结合GPU技术的并行CP张量分解算法
Parallel CP Tensor Decomposition Algorithm Combining with GPU Technology
计算机科学, 2018, 45(11): 298-303. https://doi.org/10.11896/j.issn.1002-137X.2018.11.048
[15] 张帅, 徐顺, 刘倩, 金钟.
基于GPU的分子动力学模拟Cell Verlet算法实现及其并行性能分析
Cell Verlet Algorithm of Molecular Dynamics Simulation Based on GPU and Its Parallel Performance Analysis
计算机科学, 2018, 45(10): 291-294. https://doi.org/10.11896/j.issn.1002-137X.2018.10.054
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!