摘要: 针对GPU并行计算领域缺少精确的性能分析模型和有针对性的性能优化方法,提出一种基于GPU的并行计算性能定量分析模型,其通过对指令流水线、共享存储器访存、全局存储器访存的性能建模,来定量分析并行程序,帮助程序员找到程序运行瓶颈,进行有效的性能优化。实验部分通过3个具有代表性的实际应用(稠密矩阵乘法、三对角线性方程组求解、稀疏矩阵矢量乘法)的性能分析证明了该模型的实用性,并有效地实现了算法的优化。
[1] Profiler A S.ATI Stream Profiler.http://developer.amd.com [2] Nsight N P.NVIDIA Parallel Nsight.http://developer.nvidia.com [3] Collange S,et al.Barra:A Parallel Functional Simulator forGPGPU[C]∥IEEE International Symposium on Modeling,Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS).2010 [4] Diamos G F,et al.Ocelot:A dynamic optimization frameworkfor bulk-synchronous applications in heterogeneous systems[C]∥ 19th International Conference on Parallel Architectures and Compilation Techniques,PACT 2010.Vienna,Austria:Institute of Electrical and Electronics Engineers Inc,2010 [5] Ryoo S,et al.Program optimization carving for GPU computing[J].Journal of Parallel and Distributed Computing,2008,68(10):1389-1401 [6] Liu Y,Zhang E Z,Shen X.A Cross-Input Adaptive Framework for GPU Program Optimizations[C]∥23rd IEEE International Parallel and Distributed Processing Symposium,IPDPS 2009.Rome,Italy:IEEE Computer Society,2009 [7] Meng J,Skadron K.Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs[C]∥23rd International Conference on Supercomputing,ICS’09.Yorktown Heights,NY,United states:Association for Computing Machine-ry,2009 [8] Choi J W,Singh A,Vuduc R W.Model-driven autotuning ofsparse matrix-vector multiply on GPUs[C]∥2010ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,PPoPP’10.Bangalore,India:Association for Computing Machinery,2010 [9] Baskaran M M,et al.A compiler framework for optimization of affine loop nests for GPGPUs[C]∥22nd ACM International Conference on Supercomputing,ICS’08.Island of Kos,Greece:Association for Computing Machinery,2008 [10] Collange S,et al.Barra:A Parallel Functional Simulator forGPGPU.in Modeling,Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS)[C]∥2010IEEE International Symposium on.2010 [11] Volkov V,Demmel J W.Benchmarking GPUs to tune dense linear algebra[C]∥2008SC-International Conference for High Performance Computing,Networking,Storage and Analysis,SC 2008.Austin,TX,United states:IEEE Computer Society,2008 [12] Zhang Y,Cohen J,Owens J D.Fast tridiagonal solvers on the GPU[C]∥2010ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,PPoPP’10.Bangalore,India:Association for Computing Machinery,2010 [13] Goddeke D,Strzodka R.Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid [J].IEEE Transactions on Parallel and Distributed Systems,2011,23(1):22-32 [14] Bell N,Garland M.Implementing sparse matrix-vector multiplication on throughput-oriented processors[C]∥SC’09:Procee-dings of the 2009ACM/IEEE Conference on Supercomputing.Nov.2009,18:1-11 [15] Choi J W,Singh A,Vuduc R W.Model driven autotuning of sparse matrix-vector multiply on GPUs[C]∥Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010).ACM,Jan.2010:115-126 |
No related articles found! |
|