面向GPU计算平台的归约算法的性能优化研究

doi:10.11896/j.issn.1002-137X.2019.02.047

计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 306-314.doi: 10.11896/j.issn.1002-137X.2019.02.047

面向GPU计算平台的归约算法的性能优化研究

张逸然¹, 陈龙², 安向哲², 颜深根³

北京信息科技大学北京100049¹
中国石油集团东方地球物理勘探有限责任公司河北涿州072751²
深圳市商汤科技有限公司广东深圳518000³

收稿日期:2018-09-12 出版日期:2019-02-25 发布日期:2019-02-25
作者简介:张逸然(1999－),男,主要研究方向为高性能计算、众核编程与优化方法;陈龙(1973－),男,硕士,主要研究方向为高性能计算、人工智能在勘探行业的应用;安向哲(1974－),硕士,主要研究方向为云计算;颜深根(1985－),男,博士,主要研究方向为GPU编程与优化、自适应性能调优、大规模分布式深度学习等。

Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform

ZHANG Yi-ran¹, CHEN Long², AN Xiang-zhe², YAN Shen-gen³

Beijing Information Science & Technology University,Beijing 100049,China¹
BGP INC.,China National Petroleum Corporation,Zhuozhou,Hebei 072751,China²
SenseTime,Shenzhen,Guangdong 518000,China³

Received:2018-09-12 Online:2019-02-25 Published:2019-02-25

摘要/Abstract

摘要： 归约算法在科学计算和图像处理等领域有着十分广泛的应用,是并行计算的基本算法之一,因此对归约算法进行加速具有重要意义。为了充分挖掘异构计算平台下GPU的计算能力以对归约算法进行加速,文中提出基于线程内归约、work-group内归约和work-group间归约3个层面的归约优化方法,并打破以往相关工作将优化重心集中在work-group内归约上的传统思维,通过论证指出线程内归约才是归约算法的瓶颈所在。实验结果表明,在不同的数据规模下,所提归约算法与经过精心优化的OpenCV库的CPU版本相比,在AMD W8000和NVIDIA Tesla K20M平台上分别达到了3.91~15.93和2.97~20.24的加速比;相比于OpenCV库的CUDA版本与OpenCL版本,在NVIDIA Tesla K20M平台上分别达到了2.25~5.97和1.25~1.75的加速比;相比于OpenCL版本,在AMD W8000平台上达到了1.24~5.15的加速比。文中工作不仅实现了归约算法在GPU计算平台上的高性能,而且实现了在不同GPU计算平台间的性能可移植。

关键词: GPU, OpenCL, 归约算法, 线程内归约

Abstract: Reduction algorithm has wide application in scientific computing and image processing,and it is one of the basic algorithms of parallel computing.Hence,it is significant to accelerate reduction algorithm.In order to fully exploit the capability of GPU for general-purpose computing under heterogeneous processing platform,this paper proposed a multi-level reduction optimization algorithm including inner-thread reduction,inner-work-group reduction and inter-work-group reduction.Different from the traditional way of reduction algorithm optimization of putting more emphasis on inner-work-group reduction,this paper proved that inner-thread reduction is the true bottleneck of reduction algorithm.The experimental results demonstrate that the performance of proposed reduction algorithm has reached 3.91~15.93 and 2.97~20.24 times speedup respectively in AMD W8000 and NVIDIA Tesla K20M under different sizes of data set,compared with carefully optimized CPU version of OpenCV library.In NVIDIA Tesla K20M,compared with CUDA version and OpenCL version of OpenCV library,the algorithm has reached 2.25~5.97 and 1.25~1.75 times speedup respectively.And compared with OpenCL version of OpenCV library in AMD W8000,the algorithm has reached 1.24~5.15 times speedup.This work not only realizes high performance of reduction algorithm on GPU platform,but also reaches the portability of performance between different GPU computing platforms.

Key words: GPU, Inner threads reduction, OpenCL, Reduction algorithm

中图分类号:

TP391

张逸然, 陈龙, 安向哲, 颜深根. 面向GPU计算平台的归约算法的性能优化研究[J]. 计算机科学, 2019, 46(2): 306-314. https://doi.org/10.11896/j.issn.1002-137X.2019.02.047

ZHANG Yi-ran, CHEN Long, AN Xiang-zhe, YAN Shen-gen. Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform[J]. Computer Science, 2019, 46(2): 306-314. https://doi.org/10.11896/j.issn.1002-137X.2019.02.047

参考文献

[1]HARRIS M.Optimizing parallel reduction in CUDA[OL].http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf.
[2]CATANZARO B.Opencl optimization case study:Simple reductions[OL].https://developer.amd.com/article/opencl-optimization-case-study-simple-reductions.
[3]YAN S G,YUNQUAN Z,GUOPING L.Reduction algorithm optimization based on the OpenCL[J].Journal of Software,2011,22(S2):163-171.
[4]OpenCV Library Open Source Project[OL].http://opencv. org.
[5]BRADSKI G,KAEHLER A.Learning OpenCV:Computer vision with the OpenCV library[M].O’Reilly Media,Inc.,2008.
[6]RAWAT P S,RASTELLO F,SUKUMARAN-RAJAM A,et al.Register optimizations for stencils on GPUs[C]∥Procee-dings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18).New York,NY,USA,2018.
[7]CECKA C.Low communication FMM-accelerated FFT on GPUs [C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis (SC’17).ACM,New York,NY,USA,2017.
[8]ZHANG X X,TAN G M,XUE S B,et al.Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning[C]∥Proceedings of the 22nd ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming (PPoPP’17).New York,NY,USA,2017.
[9]LI B,SUN J,ANNAVARAM M,et al.Elastic-Cache:GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management[C]∥Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Orlando,FL,2017.
[10]KIRK D B,WEN-MEI W H.Programming massively parallel processors:a hands-on approach[M].Newnes,2012.
[11]WU E H,LIU Y Q.The General Computing based on GPU[J].Proceeding of Journal of Computer-Aided Design and Computer Graphics,2004,16(5):601-612.
[12]OWENS J D,HOUSTON M,LUEBKE D.et al.GPU Computing[J].Proceedings of the IEEE,2008,96(5):879-899.
[13]OWENS J D,LUEBKE D,GOVINDARAJU N,et al.A survey of general-purpose computation on graphics hardware[J].Computer Graphics Forum:Journal of the European Association for Computer Graphics,2007,26(1):80-113.
[14]Compute unified device architecture programming guide[OL].http://www.nvidia.com/object/cuda_home.html.
[15]Nvidia cuda C.programming guide[OL].https://docs.nvidia.com/cuda/cuda-c-programming-guide.
[16]NVIDIA cuda C.Best practices guide[OL].https://docs. nvidia.com/cuda/cuda-c-best-practices-guide.
[17]The opencl 1.2 speciffication[M].Khronos OpenCL Working Group,2012.
[18]STONE J E,GOHARA D,SHI G.OpenCL:A parallel programming standard for heterogeneous computing systems[J].Computing in Science & Engineering,2010,12(1-3):66-73.
[19]GASTER B,HOWES L,KAELI D R,et al.Heterogeneous Computing with OpenCL:Revised OpenCL 1[M].Newnes,2012.
[20]ZOU H,WANG H Q,HUANG Y.GPU Accelerated Rainbow Tables Analysis of MD5 Hash Password[J].Journal of Chongqing University of Technology(Natural Science),2013,27(7):61-66.(in Chinese)
邹航,王华秋,黄勇.基于GPU加速的彩虹表分析MD5哈希密码[J].重庆理工大学学报(自然科学),2013,27(7):61-66.

相关文章 15

[1]	宗迪迪, 谢益武. 基于法线迭代的模型中轴生成方法 Model Medial Axis Generation Method Based on Normal Iteration 计算机科学, 2022, 49(6A): 764-770. https://doi.org/10.11896/jsjkx.210400050
[2]	汪晋, 刘江. 基于GPU的并行DILU预处理技术 GPU-based Parallel DILU Preconditioning Technique 计算机科学, 2022, 49(6): 108-118. https://doi.org/10.11896/jsjkx.210300259
[3]	李繁, 严星, 张晓宇. 基于GPU的特征脸算法优化研究 Optimization of GPU-based Eigenface Algorithm 计算机科学, 2021, 48(4): 197-204. https://doi.org/10.11896/jsjkx.200600033
[4]	齐延荣, 周夏冰, 李斌, 周清雷. 基于FPGA的CNN图像识别加速与优化 FPGA-based CNN Image Recognition Acceleration and Optimization 计算机科学, 2021, 48(4): 205-212. https://doi.org/10.11896/jsjkx.200600089
[5]	胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213
[6]	刘世芳, 赵永华, 于天禹, 黄荣锋. 广义稠密对称特征问题标准化算法在GPU集群上的有效实现 Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster 计算机科学, 2020, 47(4): 6-12. https://doi.org/10.11896/jsjkx.191000009
[7]	左宪禹, 张哲, 苏岳瀚, 刘扬, 葛强, 田军锋. 基于GPU多流并发并行模型的NDVI提取算法 Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model 计算机科学, 2020, 47(4): 25-29. https://doi.org/10.11896/jsjkx.190500029
[8]	康林瑶, 唐兵, 夏艳敏, 张黎. 基于GPU加速和非负矩阵分解的并行协同过滤推荐算法 GPU-accelerated Non-negative Matrix Factorization-based Parallel Collaborative Filtering Recommendation Algorithm 计算机科学, 2019, 46(8): 106-110. https://doi.org/10.11896/j.issn.1002-137X.2019.08.017
[9]	苏庆华, 付景超, 谷焓, 张姗姗, 李奕飞, 江方舟, 白翰林, 赵地. 前列腺癌辅助诊断GPU并行算法设计 Parallel Algorithm Design for Assisted Diagnosis of Prostate Cancer 计算机科学, 2019, 46(11A): 524-527.
[10]	卢献华, 王洪俊. 基于大数据计算框架的分布式新闻聚类系统设计 Design of Distributed News Clustering System Based on Big Data Computing Framework 计算机科学, 2019, 46(11A): 220-223.
[11]	邓定胜. 一种基于可编程GPU的实时烟雾模拟算法研究 Study on Real-time Smoke Simulation Algorithm Based on Programmable GPU 计算机科学, 2019, 46(11A): 604-608.
[12]	刘玉成, 理查德·丁, 张颖超. 一种BPNNs识别算法的医学检测泛实时性问题研究 Research on Pan-real-time Problem of Medical Detection Based on BPNNs Recognition Algorithm 计算机科学, 2018, 45(6): 301-307. https://doi.org/10.11896/j.issn.1002-137X.2018.06.053
[13]	廖星,袁景凌,陈旻骋. 一种自适应权重的并行PSO快速装箱算法 Parallel PSO Container Packing Algorithm with Adaptive Weight 计算机科学, 2018, 45(3): 231-234. https://doi.org/10.11896/j.issn.1002-137X.2018.03.036
[14]	武昱, 闫光辉, 王雅斐, 马青青, 刘宇轩. 结合GPU技术的并行CP张量分解算法 Parallel CP Tensor Decomposition Algorithm Combining with GPU Technology 计算机科学, 2018, 45(11): 298-303. https://doi.org/10.11896／j.issn.1002-137X.2018.11.048
[15]	张帅, 徐顺, 刘倩, 金钟. 基于GPU的分子动力学模拟Cell Verlet算法实现及其并行性能分析 Cell Verlet Algorithm of Molecular Dynamics Simulation Based on GPU and Its Parallel Performance Analysis 计算机科学, 2018, 45(10): 291-294. https://doi.org/10.11896／j.issn.1002-137X.2018.10.054

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

面向GPU计算平台的归约算法的性能优化研究

Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0