基于GPU的并行计算性能分析模型

计算机科学 ›› 2014, Vol. 41 ›› Issue (1): 31-38.

基于GPU的并行计算性能分析模型

王卓薇,程良伦,赵武清

广东工业大学计算机学院广州510006;广东工业大学计算机学院广州510006;广东工业大学计算机学院广州510006

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受广州市科技项目(2012Y2-0031),博士后基金(2013M531825),国家自然科学基金(U1201251)资助

Parallel Computation Performance Analysis Model Based on GPU

WANG Zhuo-wei,CHENG Liang-lun and ZHAO Wu-qing

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 针对GPU并行计算领域缺少精确的性能分析模型和有针对性的性能优化方法,提出一种基于GPU的并行计算性能定量分析模型,其通过对指令流水线、共享存储器访存、全局存储器访存的性能建模,来定量分析并行程序,帮助程序员找到程序运行瓶颈,进行有效的性能优化。实验部分通过3个具有代表性的实际应用(稠密矩阵乘法、三对角线性方程组求解、稀疏矩阵矢量乘法)的性能分析证明了该模型的实用性,并有效地实现了算法的优化。

关键词: GPU,性能定量分析模型,指令流水线,共享存储器访存,全局存储器访存

Abstract: In order to solve the problem of lacking accurate performance analysis model in parallel computation field based on GPU,we proposed a quantitative performance model which can simulate the performance of three major components of GPU including instruction pipeline,shared memory access time,and global memory access time．It is designed to build a performance model that helps programmer find the performance bottlenecks and improve the system’s performance efficiently．To demonstrate the usefulness of the model and to optimize the algorithms performance,we analyzed three representative real-world programs:dense matrix multiplication,tridiagonal systems solver,and sparse matrix vector multiplication.

Key words: GPU,Quantitative performance model,Instruction pipeline,Shared memory access time,Global memory access time

王卓薇,程良伦,赵武清. 基于GPU的并行计算性能分析模型[J]. 计算机科学, 2014, 41(1): 31-38. https://doi.org/

WANG Zhuo-wei,CHENG Liang-lun and ZHAO Wu-qing. Parallel Computation Performance Analysis Model Based on GPU[J]. Computer Science, 2014, 41(1): 31-38. https://doi.org/

参考文献

[1] Profiler A S．ATI Stream Profiler．http://developer.amd.com
[2] Nsight N P．NVIDIA Parallel Nsight．http://developer.nvidia.com
[3] Collange S,et al．Barra:A Parallel Functional Simulator forGPGPU[C]∥IEEE International Symposium on Modeling,Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS)．2010
[4] Diamos G F,et al．Ocelot:A dynamic optimization frameworkfor bulk-synchronous applications in heterogeneous systems[C]∥ 19th International Conference on Parallel Architectures and Compilation Techniques,PACT 2010．Vienna,Austria:Institute of Electrical and Electronics Engineers Inc,2010
[5] Ryoo S,et al.Program optimization carving for GPU computing[J]．Journal of Parallel and Distributed Computing,2008,68(10):1389-1401
[6] Liu Y,Zhang E Z,Shen X．A Cross-Input Adaptive Framework for GPU Program Optimizations[C]∥23rd IEEE International Parallel and Distributed Processing Symposium,IPDPS 2009．Rome,Italy:IEEE Computer Society,2009
[7] Meng J,Skadron K．Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs[C]∥23rd International Conference on Supercomputing,ICS’09．Yorktown Heights,NY,United states:Association for Computing Machine-ry,2009
[8] Choi J W,Singh A,Vuduc R W．Model-driven autotuning ofsparse matrix-vector multiply on GPUs[C]∥2010ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,PPoPP’10．Bangalore,India:Association for Computing Machinery,2010
[9] Baskaran M M,et al．A compiler framework for optimization of affine loop nests for GPGPUs[C]∥22nd ACM International Conference on Supercomputing,ICS’08．Island of Kos,Greece:Association for Computing Machinery,2008
[10] Collange S,et al．Barra:A Parallel Functional Simulator forGPGPU．in Modeling,Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS)[C]∥2010IEEE International Symposium on．2010
[11] Volkov V,Demmel J W．Benchmarking GPUs to tune dense linear algebra[C]∥2008SC-International Conference for High Performance Computing,Networking,Storage and Analysis,SC 2008．Austin,TX,United states:IEEE Computer Society,2008
[12] Zhang Y,Cohen J,Owens J D．Fast tridiagonal solvers on the GPU[C]∥2010ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,PPoPP’10．Bangalore,India:Association for Computing Machinery,2010
[13] Goddeke D,Strzodka R．Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid [J]．IEEE Transactions on Parallel and Distributed Systems,2011,23(1):22-32
[14] Bell N,Garland M．Implementing sparse matrix-vector multiplication on throughput-oriented processors[C]∥SC’09:Procee-dings of the 2009ACM/IEEE Conference on Supercomputing．Nov.2009,18:1-11
[15] Choi J W,Singh A,Vuduc R W．Model driven autotuning of sparse matrix-vector multiply on GPUs[C]∥Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010)．ACM,Jan．2010:115-126

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed