计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 306-314.doi: 10.11896/j.issn.1002-137X.2019.02.047
张逸然1, 陈龙2, 安向哲2, 颜深根3
ZHANG Yi-ran1, CHEN Long2, AN Xiang-zhe2, YAN Shen-gen3
摘要: 归约算法在科学计算和图像处理等领域有着十分广泛的应用,是并行计算的基本算法之一,因此对归约算法进行加速具有重要意义。为了充分挖掘异构计算平台下GPU的计算能力以对归约算法进行加速,文中提出基于线程内归约、work-group内归约和work-group间归约3个层面的归约优化方法,并打破以往相关工作将优化重心集中在work-group内归约上的传统思维,通过论证指出线程内归约才是归约算法的瓶颈所在。实验结果表明,在不同的数据规模下,所提归约算法与经过精心优化的OpenCV库的CPU版本相比,在AMD W8000和NVIDIA Tesla K20M平台上分别达到了3.91~15.93和2.97~20.24的加速比;相比于OpenCV库的CUDA版本与OpenCL版本,在NVIDIA Tesla K20M平台上分别达到了2.25~5.97和1.25~1.75的加速比;相比于OpenCL版本,在AMD W8000平台上达到了1.24~5.15的加速比。文中工作不仅实现了归约算法在GPU计算平台上的高性能,而且实现了在不同GPU计算平台间的性能可移植。
中图分类号:
[1]HARRIS M.Optimizing parallel reduction in CUDA[OL].http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf. [2]CATANZARO B.Opencl optimization case study:Simple reductions[OL].https://developer.amd.com/article/opencl-optimization-case-study-simple-reductions. [3]YAN S G,YUNQUAN Z,GUOPING L.Reduction algorithm optimization based on the OpenCL[J].Journal of Software,2011,22(S2):163-171. [4]OpenCV Library Open Source Project[OL].http://opencv. org. [5]BRADSKI G,KAEHLER A.Learning OpenCV:Computer vision with the OpenCV library[M].O’Reilly Media,Inc.,2008. [6]RAWAT P S,RASTELLO F,SUKUMARAN-RAJAM A,et al.Register optimizations for stencils on GPUs[C]∥Procee-dings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18).New York,NY,USA,2018. [7]CECKA C.Low communication FMM-accelerated FFT on GPUs [C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis (SC’17).ACM,New York,NY,USA,2017. [8]ZHANG X X,TAN G M,XUE S B,et al.Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning[C]∥Proceedings of the 22nd ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming (PPoPP’17).New York,NY,USA,2017. [9]LI B,SUN J,ANNAVARAM M,et al.Elastic-Cache:GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management[C]∥Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Orlando,FL,2017. [10]KIRK D B,WEN-MEI W H.Programming massively parallel processors:a hands-on approach[M].Newnes,2012. [11]WU E H,LIU Y Q.The General Computing based on GPU[J].Proceeding of Journal of Computer-Aided Design and Computer Graphics,2004,16(5):601-612. [12]OWENS J D,HOUSTON M,LUEBKE D.et al.GPU Computing[J].Proceedings of the IEEE,2008,96(5):879-899. [13]OWENS J D,LUEBKE D,GOVINDARAJU N,et al.A survey of general-purpose computation on graphics hardware[J].Computer Graphics Forum:Journal of the European Association for Computer Graphics,2007,26(1):80-113. [14]Compute unified device architecture programming guide[OL].http://www.nvidia.com/object/cuda_home.html. [15]Nvidia cuda C.programming guide[OL].https://docs.nvidia.com/cuda/cuda-c-programming-guide. [16]NVIDIA cuda C.Best practices guide[OL].https://docs. nvidia.com/cuda/cuda-c-best-practices-guide. [17]The opencl 1.2 speciffication[M].Khronos OpenCL Working Group,2012. [18]STONE J E,GOHARA D,SHI G.OpenCL:A parallel programming standard for heterogeneous computing systems[J].Computing in Science & Engineering,2010,12(1-3):66-73. [19]GASTER B,HOWES L,KAELI D R,et al.Heterogeneous Computing with OpenCL:Revised OpenCL 1[M].Newnes,2012. [20]ZOU H,WANG H Q,HUANG Y.GPU Accelerated Rainbow Tables Analysis of MD5 Hash Password[J].Journal of Chongqing University of Technology(Natural Science),2013,27(7):61-66.(in Chinese) 邹航,王华秋,黄勇.基于GPU加速的彩虹表分析MD5哈希密码[J].重庆理工大学学报(自然科学),2013,27(7):61-66. |
[1] | 宗迪迪, 谢益武. 基于法线迭代的模型中轴生成方法 Model Medial Axis Generation Method Based on Normal Iteration 计算机科学, 2022, 49(6A): 764-770. https://doi.org/10.11896/jsjkx.210400050 |
[2] | 汪晋, 刘江. 基于GPU的并行DILU预处理技术 GPU-based Parallel DILU Preconditioning Technique 计算机科学, 2022, 49(6): 108-118. https://doi.org/10.11896/jsjkx.210300259 |
[3] | 李繁, 严星, 张晓宇. 基于GPU的特征脸算法优化研究 Optimization of GPU-based Eigenface Algorithm 计算机科学, 2021, 48(4): 197-204. https://doi.org/10.11896/jsjkx.200600033 |
[4] | 齐延荣, 周夏冰, 李斌, 周清雷. 基于FPGA的CNN图像识别加速与优化 FPGA-based CNN Image Recognition Acceleration and Optimization 计算机科学, 2021, 48(4): 205-212. https://doi.org/10.11896/jsjkx.200600089 |
[5] | 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213 |
[6] | 刘世芳, 赵永华, 于天禹, 黄荣锋. 广义稠密对称特征问题标准化算法在GPU集群上的有效实现 Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster 计算机科学, 2020, 47(4): 6-12. https://doi.org/10.11896/jsjkx.191000009 |
[7] | 左宪禹, 张哲, 苏岳瀚, 刘扬, 葛强, 田军锋. 基于GPU多流并发并行模型的NDVI提取算法 Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model 计算机科学, 2020, 47(4): 25-29. https://doi.org/10.11896/jsjkx.190500029 |
[8] | 康林瑶, 唐兵, 夏艳敏, 张黎. 基于GPU加速和非负矩阵分解的并行协同过滤推荐算法 GPU-accelerated Non-negative Matrix Factorization-based Parallel Collaborative Filtering Recommendation Algorithm 计算机科学, 2019, 46(8): 106-110. https://doi.org/10.11896/j.issn.1002-137X.2019.08.017 |
[9] | 苏庆华, 付景超, 谷焓, 张姗姗, 李奕飞, 江方舟, 白翰林, 赵地. 前列腺癌辅助诊断GPU并行算法设计 Parallel Algorithm Design for Assisted Diagnosis of Prostate Cancer 计算机科学, 2019, 46(11A): 524-527. |
[10] | 卢献华, 王洪俊. 基于大数据计算框架的分布式新闻聚类系统设计 Design of Distributed News Clustering System Based on Big Data Computing Framework 计算机科学, 2019, 46(11A): 220-223. |
[11] | 邓定胜. 一种基于可编程GPU的实时烟雾模拟算法研究 Study on Real-time Smoke Simulation Algorithm Based on Programmable GPU 计算机科学, 2019, 46(11A): 604-608. |
[12] | 刘玉成, 理查德·丁, 张颖超. 一种BPNNs识别算法的医学检测泛实时性问题研究 Research on Pan-real-time Problem of Medical Detection Based on BPNNs Recognition Algorithm 计算机科学, 2018, 45(6): 301-307. https://doi.org/10.11896/j.issn.1002-137X.2018.06.053 |
[13] | 廖星,袁景凌,陈旻骋. 一种自适应权重的并行PSO快速装箱算法 Parallel PSO Container Packing Algorithm with Adaptive Weight 计算机科学, 2018, 45(3): 231-234. https://doi.org/10.11896/j.issn.1002-137X.2018.03.036 |
[14] | 武昱, 闫光辉, 王雅斐, 马青青, 刘宇轩. 结合GPU技术的并行CP张量分解算法 Parallel CP Tensor Decomposition Algorithm Combining with GPU Technology 计算机科学, 2018, 45(11): 298-303. https://doi.org/10.11896/j.issn.1002-137X.2018.11.048 |
[15] | 张帅, 徐顺, 刘倩, 金钟. 基于GPU的分子动力学模拟Cell Verlet算法实现及其并行性能分析 Cell Verlet Algorithm of Molecular Dynamics Simulation Based on GPU and Its Parallel Performance Analysis 计算机科学, 2018, 45(10): 291-294. https://doi.org/10.11896/j.issn.1002-137X.2018.10.054 |
|