Computer Science ›› 2019, Vol. 46 ›› Issue (2): 306-314.doi: 10.11896/j.issn.1002-137X.2019.02.047

• Interdiscipline & Frontier • Previous Articles     Next Articles

Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform

ZHANG Yi-ran1, CHEN Long2, AN Xiang-zhe2, YAN Shen-gen3   

  1. Beijing Information Science & Technology University,Beijing 100049,China1
    BGP INC.,China National Petroleum Corporation,Zhuozhou,Hebei 072751,China2
    SenseTime,Shenzhen,Guangdong 518000,China3
  • Received:2018-09-12 Online:2019-02-25 Published:2019-02-25

Abstract: Reduction algorithm has wide application in scientific computing and image processing,and it is one of the basic algorithms of parallel computing.Hence,it is significant to accelerate reduction algorithm.In order to fully exploit the capability of GPU for general-purpose computing under heterogeneous processing platform,this paper proposed a multi-level reduction optimization algorithm including inner-thread reduction,inner-work-group reduction and inter-work-group reduction.Different from the traditional way of reduction algorithm optimization of putting more emphasis on inner-work-group reduction,this paper proved that inner-thread reduction is the true bottleneck of reduction algorithm.The experimental results demonstrate that the performance of proposed reduction algorithm has reached 3.91~15.93 and 2.97~20.24 times speedup respectively in AMD W8000 and NVIDIA Tesla K20M under different sizes of data set,compared with carefully optimized CPU version of OpenCV library.In NVIDIA Tesla K20M,compared with CUDA version and OpenCL version of OpenCV library,the algorithm has reached 2.25~5.97 and 1.25~1.75 times speedup respectively.And compared with OpenCL version of OpenCV library in AMD W8000,the algorithm has reached 1.24~5.15 times speedup.This work not only realizes high performance of reduction algorithm on GPU platform,but also reaches the portability of performance between different GPU computing platforms.

Key words: Reduction algorithm, GPU, Inner threads reduction, OpenCL

CLC Number: 

  • TP391
[1]HARRIS M.Optimizing parallel reduction in CUDA[OL].http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf.
[2]CATANZARO B.Opencl optimization case study:Simple reductions[OL].https://developer.amd.com/article/opencl-optimization-case-study-simple-reductions.
[3]YAN S G,YUNQUAN Z,GUOPING L.Reduction algorithm optimization based on the OpenCL[J].Journal of Software,2011,22(S2):163-171.
[4]OpenCV Library Open Source Project[OL].http://opencv. org.
[5]BRADSKI G,KAEHLER A.Learning OpenCV:Computer vision with the OpenCV library[M].O’Reilly Media,Inc.,2008.
[6]RAWAT P S,RASTELLO F,SUKUMARAN-RAJAM A,et al.Register optimizations for stencils on GPUs[C]∥Procee-dings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18).New York,NY,USA,2018.
[7]CECKA C.Low communication FMM-accelerated FFT on GPUs [C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis (SC’17).ACM,New York,NY,USA,2017.
[8]ZHANG X X,TAN G M,XUE S B,et al.Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning[C]∥Proceedings of the 22nd ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming (PPoPP’17).New York,NY,USA,2017.
[9]LI B,SUN J,ANNAVARAM M,et al.Elastic-Cache:GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management[C]∥Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Orlando,FL,2017.
[10]KIRK D B,WEN-MEI W H.Programming massively parallel processors:a hands-on approach[M].Newnes,2012.
[11]WU E H,LIU Y Q.The General Computing based on GPU[J].Proceeding of Journal of Computer-Aided Design and Computer Graphics,2004,16(5):601-612.
[12]OWENS J D,HOUSTON M,LUEBKE D.et al.GPU Computing[J].Proceedings of the IEEE,2008,96(5):879-899.
[13]OWENS J D,LUEBKE D,GOVINDARAJU N,et al.A survey of general-purpose computation on graphics hardware[J].Computer Graphics Forum:Journal of the European Association for Computer Graphics,2007,26(1):80-113.
[14]Compute unified device architecture programming guide[OL].http://www.nvidia.com/object/cuda_home.html.
[15]Nvidia cuda C.programming guide[OL].https://docs.nvidia.com/cuda/cuda-c-programming-guide.
[16]NVIDIA cuda C.Best practices guide[OL].https://docs. nvidia.com/cuda/cuda-c-best-practices-guide.
[17]The opencl 1.2 speciffication[M].Khronos OpenCL Working Group,2012.
[18]STONE J E,GOHARA D,SHI G.OpenCL:A parallel programming standard for heterogeneous computing systems[J].Computing in Science & Engineering,2010,12(1-3):66-73.
[19]GASTER B,HOWES L,KAELI D R,et al.Heterogeneous Computing with OpenCL:Revised OpenCL 1[M].Newnes,2012.
[20]ZOU H,WANG H Q,HUANG Y.GPU Accelerated Rainbow Tables Analysis of MD5 Hash Password[J].Journal of Chongqing University of Technology(Natural Science),2013,27(7):61-66.(in Chinese)
邹航,王华秋,黄勇.基于GPU加速的彩虹表分析MD5哈希密码[J].重庆理工大学学报(自然科学),2013,27(7):61-66.
[1] LIU Shi-fang, ZHAO Yong-hua, YU Tian-yu, HUANG Rong-feng. Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster [J]. Computer Science, 2020, 47(4): 6-12.
[2] ZUO Xian-yu, ZHANG Zhe, SU Yue-han, LIU Yang, GE Qiang, TIAN Jun-feng. Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model [J]. Computer Science, 2020, 47(4): 25-29.
[3] KANG Lin-yao, TANG Bing, XIA Yan-min, ZHANG Li. GPU-accelerated Non-negative Matrix Factorization-based Parallel Collaborative Filtering Recommendation Algorithm [J]. Computer Science, 2019, 46(8): 106-110.
[4] LU Xian-hua, WANG Hong-jun. Design of Distributed News Clustering System Based on Big Data Computing Framework [J]. Computer Science, 2019, 46(11A): 220-223.
[5] ZHU Chao, WU Su-ping. Parallel Harris Feature Point Detection Algorithm [J]. Computer Science, 2019, 46(11A): 289-293.
[6] SU Qing-hua, FU Jing-chao, GU Han, ZHANG Shan-shan, LI Yi-fei, JIANG Fang-zhou, BAI Han-lin, ZHAO Di. Parallel Algorithm Design for Assisted Diagnosis of Prostate Cancer [J]. Computer Science, 2019, 46(11A): 524-527.
[7] LIU Yu-cheng, Richard·DING, ZHANG Ying-chao. Research on Pan-real-time Problem of Medical Detection Based on BPNNs Recognition Algorithm [J]. Computer Science, 2018, 45(6): 301-307.
[8] LIAO Xing, YUAN Jing-ling and CHEN Min-cheng. Parallel PSO Container Packing Algorithm with Adaptive Weight [J]. Computer Science, 2018, 45(3): 231-234.
[9] WU Yu, YAN Guang-hui, WANG Ya-fei, MA Qing-qing, LIU Yu-xuan. Parallel CP Tensor Decomposition Algorithm Combining with GPU Technology [J]. Computer Science, 2018, 45(11): 298-303.
[10] ZHANG Shuai, XU Shun, LIU Qian, JIN Zhong. Cell Verlet Algorithm of Molecular Dynamics Simulation Based on GPU and Its Parallel Performance Analysis [J]. Computer Science, 2018, 45(10): 291-294.
[11] LU Jia-ming and ZHU Zhe. Real-time 4K Panoramic Video Stitching Based on GPU Acceleration [J]. Computer Science, 2017, 44(8): 18-21.
[12] YIN Meng-jia, XU Xian-bin, HE Shui-bing, HU Jing, YE Cong-huan and ZHANG Tao. Performance Model of Sparse Matrix Vector Multiplication on GPU [J]. Computer Science, 2017, 44(4): 182-187.
[13] YUAN Bin. Level-Set GPU Evolution Algorithm Based on B-spline [J]. Computer Science, 2017, 44(3): 59-62.
[14] SHI Xue-kai, WANG Wen-ke, HUANG Hui, LI Si-kun and FU Yi-qi. Volume Rendering Method of Mass Brain Imaging Data Based on Compression Domain [J]. Computer Science, 2017, 44(3): 27-31.
[15] LIU Jin-shuo, JIANG Zhuang-yi, XU Ya-bo, DENG Juan and ZHANG Lan-xin. Multithread and GPU Parallel Schema on Patch-based Multi-view Stereo Algorithm [J]. Computer Science, 2017, 44(2): 296-301.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . [J]. Computer Science, 2018, 1(1): 1 .
[2] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[3] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[4] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[5] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[6] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[7] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[8] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[9] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[10] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .