计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 306-314.doi: 10.11896/j.issn.1002-137X.2019.02.047

• 交叉与前沿 • 上一篇    下一篇

面向GPU计算平台的归约算法的性能优化研究

张逸然1, 陈龙2, 安向哲2, 颜深根3   

  1. 北京信息科技大学 北京1000491
    中国石油集团东方地球物理勘探有限责任公司 河北 涿州0727512
    深圳市商汤科技有限公司 广东 深圳5180003
  • 收稿日期:2018-09-12 出版日期:2019-02-25 发布日期:2019-02-25
  • 作者简介:张逸然(1999-),男,主要研究方向为高性能计算、众核编程与优化方法;陈 龙(1973-),男,硕士,主要研究方向为高性能计算、人工智能在勘探行业的应用;安向哲(1974-),硕士,主要研究方向为云计算;颜深根(1985-),男,博士,主要研究方向为GPU编程与优化、自适应性能调优、大规模分布式深度学习等。

Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform

ZHANG Yi-ran1, CHEN Long2, AN Xiang-zhe2, YAN Shen-gen3   

  1. Beijing Information Science & Technology University,Beijing 100049,China1
    BGP INC.,China National Petroleum Corporation,Zhuozhou,Hebei 072751,China2
    SenseTime,Shenzhen,Guangdong 518000,China3
  • Received:2018-09-12 Online:2019-02-25 Published:2019-02-25

摘要: 归约算法在科学计算和图像处理等领域有着十分广泛的应用,是并行计算的基本算法之一,因此对归约算法进行加速具有重要意义。为了充分挖掘异构计算平台下GPU的计算能力以对归约算法进行加速,文中提出基于线程内归约、work-group内归约和work-group间归约3个层面的归约优化方法,并打破以往相关工作将优化重心集中在work-group内归约上的传统思维,通过论证指出线程内归约才是归约算法的瓶颈所在。实验结果表明,在不同的数据规模下,所提归约算法与经过精心优化的OpenCV库的CPU版本相比,在AMD W8000和NVIDIA Tesla K20M平台上分别达到了3.91~15.93和2.97~20.24的加速比;相比于OpenCV库的CUDA版本与OpenCL版本,在NVIDIA Tesla K20M平台上分别达到了2.25~5.97和1.25~1.75的加速比;相比于OpenCL版本,在AMD W8000平台上达到了1.24~5.15的加速比。文中工作不仅实现了归约算法在GPU计算平台上的高性能,而且实现了在不同GPU计算平台间的性能可移植。

关键词: 归约算法, GPU, 线程内归约, OpenCL

Abstract: Reduction algorithm has wide application in scientific computing and image processing,and it is one of the basic algorithms of parallel computing.Hence,it is significant to accelerate reduction algorithm.In order to fully exploit the capability of GPU for general-purpose computing under heterogeneous processing platform,this paper proposed a multi-level reduction optimization algorithm including inner-thread reduction,inner-work-group reduction and inter-work-group reduction.Different from the traditional way of reduction algorithm optimization of putting more emphasis on inner-work-group reduction,this paper proved that inner-thread reduction is the true bottleneck of reduction algorithm.The experimental results demonstrate that the performance of proposed reduction algorithm has reached 3.91~15.93 and 2.97~20.24 times speedup respectively in AMD W8000 and NVIDIA Tesla K20M under different sizes of data set,compared with carefully optimized CPU version of OpenCV library.In NVIDIA Tesla K20M,compared with CUDA version and OpenCL version of OpenCV library,the algorithm has reached 2.25~5.97 and 1.25~1.75 times speedup respectively.And compared with OpenCL version of OpenCV library in AMD W8000,the algorithm has reached 1.24~5.15 times speedup.This work not only realizes high performance of reduction algorithm on GPU platform,but also reaches the portability of performance between different GPU computing platforms.

Key words: Reduction algorithm, GPU, Inner threads reduction, OpenCL

中图分类号: 

  • TP391
[1]HARRIS M.Optimizing parallel reduction in CUDA[OL].http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf.
[2]CATANZARO B.Opencl optimization case study:Simple reductions[OL].https://developer.amd.com/article/opencl-optimization-case-study-simple-reductions.
[3]YAN S G,YUNQUAN Z,GUOPING L.Reduction algorithm optimization based on the OpenCL[J].Journal of Software,2011,22(S2):163-171.
[4]OpenCV Library Open Source Project[OL].http://opencv. org.
[5]BRADSKI G,KAEHLER A.Learning OpenCV:Computer vision with the OpenCV library[M].O’Reilly Media,Inc.,2008.
[6]RAWAT P S,RASTELLO F,SUKUMARAN-RAJAM A,et al.Register optimizations for stencils on GPUs[C]∥Procee-dings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18).New York,NY,USA,2018.
[7]CECKA C.Low communication FMM-accelerated FFT on GPUs [C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis (SC’17).ACM,New York,NY,USA,2017.
[8]ZHANG X X,TAN G M,XUE S B,et al.Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning[C]∥Proceedings of the 22nd ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming (PPoPP’17).New York,NY,USA,2017.
[9]LI B,SUN J,ANNAVARAM M,et al.Elastic-Cache:GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management[C]∥Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Orlando,FL,2017.
[10]KIRK D B,WEN-MEI W H.Programming massively parallel processors:a hands-on approach[M].Newnes,2012.
[11]WU E H,LIU Y Q.The General Computing based on GPU[J].Proceeding of Journal of Computer-Aided Design and Computer Graphics,2004,16(5):601-612.
[12]OWENS J D,HOUSTON M,LUEBKE D.et al.GPU Computing[J].Proceedings of the IEEE,2008,96(5):879-899.
[13]OWENS J D,LUEBKE D,GOVINDARAJU N,et al.A survey of general-purpose computation on graphics hardware[J].Computer Graphics Forum:Journal of the European Association for Computer Graphics,2007,26(1):80-113.
[14]Compute unified device architecture programming guide[OL].http://www.nvidia.com/object/cuda_home.html.
[15]Nvidia cuda C.programming guide[OL].https://docs.nvidia.com/cuda/cuda-c-programming-guide.
[16]NVIDIA cuda C.Best practices guide[OL].https://docs. nvidia.com/cuda/cuda-c-best-practices-guide.
[17]The opencl 1.2 speciffication[M].Khronos OpenCL Working Group,2012.
[18]STONE J E,GOHARA D,SHI G.OpenCL:A parallel programming standard for heterogeneous computing systems[J].Computing in Science & Engineering,2010,12(1-3):66-73.
[19]GASTER B,HOWES L,KAELI D R,et al.Heterogeneous Computing with OpenCL:Revised OpenCL 1[M].Newnes,2012.
[20]ZOU H,WANG H Q,HUANG Y.GPU Accelerated Rainbow Tables Analysis of MD5 Hash Password[J].Journal of Chongqing University of Technology(Natural Science),2013,27(7):61-66.(in Chinese)
邹航,王华秋,黄勇.基于GPU加速的彩虹表分析MD5哈希密码[J].重庆理工大学学报(自然科学),2013,27(7):61-66.
[1] 康林瑶, 唐兵, 夏艳敏, 张黎. 基于GPU加速和非负矩阵分解的并行协同过滤推荐算法[J]. 计算机科学, 2019, 46(8): 106-110.
[2] 刘玉成, 理查德·丁, 张颖超. 一种BPNNs识别算法的医学检测泛实时性问题研究[J]. 计算机科学, 2018, 45(6): 301-307.
[3] 廖星,袁景凌,陈旻骋. 一种自适应权重的并行PSO快速装箱算法[J]. 计算机科学, 2018, 45(3): 231-234, 273.
[4] 武昱, 闫光辉, 王雅斐, 马青青, 刘宇轩. 结合GPU技术的并行CP张量分解算法[J]. 计算机科学, 2018, 45(11): 298-303,317.
[5] 张帅, 徐顺, 刘倩, 金钟. 基于GPU的分子动力学模拟Cell Verlet算法实现及其并行性能分析[J]. 计算机科学, 2018, 45(10): 291-294, 299.
[6] 卢嘉铭,朱哲. 基于GPU加速的实时4K全景视频拼接[J]. 计算机科学, 2017, 44(8): 18-21, 26.
[7] 尹孟嘉,许先斌,何水兵,胡婧,叶从欢,张涛. GPU稀疏矩阵向量乘的性能模型构造[J]. 计算机科学, 2017, 44(4): 182-187, 206.
[8] 袁斌. 基于B样条的Level-Set GPU演化算法[J]. 计算机科学, 2017, 44(3): 59-62.
[9] 时学凯,王文珂,黄辉,李思昆,傅艺绮. 基于压缩域的脑成像大数据体可视化方法[J]. 计算机科学, 2017, 44(3): 27-31.
[10] 刘金硕,江庄毅,徐亚渤,邓娟,章岚昕. PMVS算法的CPU多线程和GPU两级粒度并行策略[J]. 计算机科学, 2017, 44(2): 296-301.
[11] 司文杰,杨飞飞. 基于大规模训练神经网络的微小故障在线检测[J]. 计算机科学, 2017, 44(2): 239-243, 266.
[12] 陈静,方建滨,唐滔,杨灿群. 多核/众核平台上推荐算法的实现与性能评估[J]. 计算机科学, 2017, 44(10): 71-74.
[13] 都志辉,林璋熙,顾彦祺,Eric O.LEBIGOT,郭翔宇. 引力波cWB处理流水线的GPU加速[J]. 计算机科学, 2017, 44(10): 26-32.
[14] 甘威,张素文,雷震,李怡凡. 移动智能终端的SIFT特征检测并行算法[J]. 计算机科学, 2016, 43(Z6): 165-167.
[15] 韦博文,李涛,李广宇,汪致恒,何沐,师悦龄,刘路遥,张瑞. 使用OpenCL技术的影像快速畸变纠正方法在异构平台上的应用分析[J]. 计算机科学, 2016, 43(Z11): 167-169, 196.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 编辑部. 新网站开通,欢迎大家订阅![J]. 计算机科学, 2018, 1(1): 1 .
[2] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[3] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[4] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[5] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[6] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[7] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[8] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[9] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[10] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .