面向GPU计算平台的归约算法的性能优化研究

doi:10.11896/j.issn.1002-137X.2019.02.047

Abstract

Abstract: Reduction algorithm has wide application in scientific computing and image processing,and it is one of the basic algorithms of parallel computing.Hence,it is significant to accelerate reduction algorithm.In order to fully exploit the capability of GPU for general-purpose computing under heterogeneous processing platform,this paper proposed a multi-level reduction optimization algorithm including inner-thread reduction,inner-work-group reduction and inter-work-group reduction.Different from the traditional way of reduction algorithm optimization of putting more emphasis on inner-work-group reduction,this paper proved that inner-thread reduction is the true bottleneck of reduction algorithm.The experimental results demonstrate that the performance of proposed reduction algorithm has reached 3.91~15.93 and 2.97~20.24 times speedup respectively in AMD W8000 and NVIDIA Tesla K20M under different sizes of data set,compared with carefully optimized CPU version of OpenCV library.In NVIDIA Tesla K20M,compared with CUDA version and OpenCL version of OpenCV library,the algorithm has reached 2.25~5.97 and 1.25~1.75 times speedup respectively.And compared with OpenCL version of OpenCV library in AMD W8000,the algorithm has reached 1.24~5.15 times speedup.This work not only realizes high performance of reduction algorithm on GPU platform,but also reaches the portability of performance between different GPU computing platforms.

Key words: GPU, Inner threads reduction, OpenCL, Reduction algorithm

CLC Number:

TP391

ZHANG Yi-ran, CHEN Long, AN Xiang-zhe, YAN Shen-gen. Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform[J].Computer Science, 2019, 46(2): 306-314.

References

[1]HARRIS M.Optimizing parallel reduction in CUDA[OL].http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf.
[2]CATANZARO B.Opencl optimization case study:Simple reductions[OL].https://developer.amd.com/article/opencl-optimization-case-study-simple-reductions.
[3]YAN S G,YUNQUAN Z,GUOPING L.Reduction algorithm optimization based on the OpenCL[J].Journal of Software,2011,22(S2):163-171.
[4]OpenCV Library Open Source Project[OL].http://opencv. org.
[5]BRADSKI G,KAEHLER A.Learning OpenCV:Computer vision with the OpenCV library[M].O’Reilly Media,Inc.,2008.
[6]RAWAT P S,RASTELLO F,SUKUMARAN-RAJAM A,et al.Register optimizations for stencils on GPUs[C]∥Procee-dings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18).New York,NY,USA,2018.
[7]CECKA C.Low communication FMM-accelerated FFT on GPUs [C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis (SC’17).ACM,New York,NY,USA,2017.
[8]ZHANG X X,TAN G M,XUE S B,et al.Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning[C]∥Proceedings of the 22nd ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming (PPoPP’17).New York,NY,USA,2017.
[9]LI B,SUN J,ANNAVARAM M,et al.Elastic-Cache:GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management[C]∥Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Orlando,FL,2017.
[10]KIRK D B,WEN-MEI W H.Programming massively parallel processors:a hands-on approach[M].Newnes,2012.
[11]WU E H,LIU Y Q.The General Computing based on GPU[J].Proceeding of Journal of Computer-Aided Design and Computer Graphics,2004,16(5):601-612.
[12]OWENS J D,HOUSTON M,LUEBKE D.et al.GPU Computing[J].Proceedings of the IEEE,2008,96(5):879-899.
[13]OWENS J D,LUEBKE D,GOVINDARAJU N,et al.A survey of general-purpose computation on graphics hardware[J].Computer Graphics Forum:Journal of the European Association for Computer Graphics,2007,26(1):80-113.
[14]Compute unified device architecture programming guide[OL].http://www.nvidia.com/object/cuda_home.html.
[15]Nvidia cuda C.programming guide[OL].https://docs.nvidia.com/cuda/cuda-c-programming-guide.
[16]NVIDIA cuda C.Best practices guide[OL].https://docs. nvidia.com/cuda/cuda-c-best-practices-guide.
[17]The opencl 1.2 speciffication[M].Khronos OpenCL Working Group,2012.
[18]STONE J E,GOHARA D,SHI G.OpenCL:A parallel programming standard for heterogeneous computing systems[J].Computing in Science & Engineering,2010,12(1-3):66-73.
[19]GASTER B,HOWES L,KAELI D R,et al.Heterogeneous Computing with OpenCL:Revised OpenCL 1[M].Newnes,2012.
[20]ZOU H,WANG H Q,HUANG Y.GPU Accelerated Rainbow Tables Analysis of MD5 Hash Password[J].Journal of Chongqing University of Technology(Natural Science),2013,27(7):61-66.(in Chinese)
邹航,王华秋,黄勇.基于GPU加速的彩虹表分析MD5哈希密码[J].重庆理工大学学报(自然科学),2013,27(7):61-66.

Related Articles 15

[1]	GUO Zheng-wei, FU Ze-wen, LI Ning, BAI Lan. Study on Acceleration Algorithm for Raw Data Simulation of High Resolution Squint Spotlight SAR [J]. Computer Science, 2022, 49(8): 178-183.
[2]	ZONG Di-di, XIE Yi-wu. Model Medial Axis Generation Method Based on Normal Iteration [J]. Computer Science, 2022, 49(6A): 764-770.
[3]	WANG Jin, LIU Jiang. GPU-based Parallel DILU Preconditioning Technique [J]. Computer Science, 2022, 49(6): 108-118.
[4]	LI Fan, YAN Xing, ZHANG Xiao-yu. Optimization of GPU-based Eigenface Algorithm [J]. Computer Science, 2021, 48(4): 197-204.
[5]	QI Yan-rong, ZHOU Xia-bing, LI Bin, ZHOU Qing-lei. FPGA-based CNN Image Recognition Acceleration and Optimization [J]. Computer Science, 2021, 48(4): 205-212.
[6]	HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li. Parallel WMD Algorithm Based on GPU Acceleration [J]. Computer Science, 2021, 48(12): 24-28.
[7]	LIU Shi-fang, ZHAO Yong-hua, YU Tian-yu, HUANG Rong-feng. Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster [J]. Computer Science, 2020, 47(4): 6-12.
[8]	ZUO Xian-yu, ZHANG Zhe, SU Yue-han, LIU Yang, GE Qiang, TIAN Jun-feng. Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model [J]. Computer Science, 2020, 47(4): 25-29.
[9]	KANG Lin-yao, TANG Bing, XIA Yan-min, ZHANG Li. GPU-accelerated Non-negative Matrix Factorization-based Parallel Collaborative Filtering Recommendation Algorithm [J]. Computer Science, 2019, 46(8): 106-110.
[10]	LU Xian-hua, WANG Hong-jun. Design of Distributed News Clustering System Based on Big Data Computing Framework [J]. Computer Science, 2019, 46(11A): 220-223.
[11]	ZHU Chao, WU Su-ping. Parallel Harris Feature Point Detection Algorithm [J]. Computer Science, 2019, 46(11A): 289-293.
[12]	SU Qing-hua, FU Jing-chao, GU Han, ZHANG Shan-shan, LI Yi-fei, JIANG Fang-zhou, BAI Han-lin, ZHAO Di. Parallel Algorithm Design for Assisted Diagnosis of Prostate Cancer [J]. Computer Science, 2019, 46(11A): 524-527.
[13]	LIU Yu-cheng, Richard·DING, ZHANG Ying-chao. Research on Pan-real-time Problem of Medical Detection Based on BPNNs Recognition Algorithm [J]. Computer Science, 2018, 45(6): 301-307.
[14]	LIAO Xing, YUAN Jing-ling and CHEN Min-cheng. Parallel PSO Container Packing Algorithm with Adaptive Weight [J]. Computer Science, 2018, 45(3): 231-234.
[15]	WU Yu, YAN Guang-hui, WANG Ya-fei, MA Qing-qing, LIU Yu-xuan. Parallel CP Tensor Decomposition Algorithm Combining with GPU Technology [J]. Computer Science, 2018, 45(11): 298-303.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform

PDF (PC)