在Intel Knights Corner和NVIDIA Kepler架构上OpenACC的性能可移植性分析

doi:10.11896/j.issn.1002-137X.2015.01.017

摘要/Abstract

摘要： OpenACC 是一套基于指导语句方式的并行编程语言标准。编程者可以通过在代码中添加符合该标准的指导语句,经OpenACC编译器的编译,将串行代码并行化地移植到加速器或者协处理器上,进而获得异构加速器所带来的加速效果。OpenACC与CUDA和OpenCL这类异构并行编程技术的不同之处在于,它的目的是使编程者在应用移植过程中不需要考虑加速器或协处理器的底层硬件架构,从而降低编程难度。同时它也具有仅需维护一套代码便可在不同硬件平台上运行的优良跨平台性。因此,OpenACC是一个值得研究的并行编程标准。如今的异构加速硬件设备呈现出多元化趋势。在2013年11月的Top500榜单上排名第一的“天河二号”使用了48000块构建在Intel Knights Corner架构之上的协处理器。与此同时,发布不久的NVIDIA公司最新的Kepler架构GPU产品由于多年来的GPU市场积累也迅速形成了可观的用户群体。对于并非追求性能极限的应用移植者而言,寻求应用性能和移植简易性之间的平衡是相当重要的议题。只需要编写一套代码便可运行在这两种硬件平台上的OpenACC正迎合了用户在移植简易性上的需求。解决了移植的简易性之后,同一个应用在不同硬件平台上的性能表现便成了用户最想了解的问题。通过实验和构建性能模型向读者展示使用OpenACC移植的应用在Intel Knights Corner和NVIDIA Kepler架构硬件上的性能可移植性。

关键词: OpenACC,性能可移植性,高性能计算

Abstract: OpenACC is a programming standard designed to simplify heterogeneous parallel programming by using directives.Since OpenACC can generate OpenCL and CUDA code,meanwhile running OpenCL on Intel Knight Corner is supported by CAPS HMPP compiler,it is attractive to using OpenACC on hardwares with different underlying micro-architectures.This paper studied how realistic it is to use a single OpenACC source code for a set of hardwares with different underlying micro-architectures.Intel Knight Corner and Nvidia Kepler products are the targets in the exper- iment,since they have the latest architectures and similar peak performance.Meanwhile CAPS OpenACC compiler is used to compile EPCC OpenACC benchmark suite,Stream and MaxFlops of SHOC benchmarks to access the performance.To study the performance portability,roofline model and relative performance model were built by the data of experiments.It shows that at most 82% performance compared with peak performance on Kepler and Knight Corner is achieved by specific benchmarks,but as the rise of arithmetic intensity,the average performance is approximately 10%.And there is a big performance gap between Intel Knight Corner and Nvidia Kepler on several benchmarks.This study confirmed that performance portability of OpenACC is related to the arithmetic intensity and a big performance gap still exsits in specific benchmarks between different hardware platforms.

Key words: OpenACC,Performance portabilty,High performance computing

王一超,秦强,施忠伟,林新华. 在Intel Knights Corner和NVIDIA Kepler架构上OpenACC的性能可移植性分析[J]. 计算机科学, 2015, 42(1): 75-78. https://doi.org/10.11896/j.issn.1002-137X.2015.01.017

WANG Yi-chao, QIN Qiang, SEE Simon and LIN Xin-hua. Performance Portability Evaluation for OpenACC on Intel Knights Corner and NVIDIA Kepler[J]. Computer Science, 2015, 42(1): 75-78. https://doi.org/10.11896/j.issn.1002-137X.2015.01.017

参考文献

[1] Kurkure N,Das A,Valmiki M,et al.Evaluation of RodiniaCodes on Intel Xeon Phi[C]∥4th International Conference on International Conference on Intelligent Systems,Modelling and Simulation,2013.Bangkok:IEEE,2013:415-419
[2] Aoki T.Application Performances on Many-core Processors Xeon Phi versus Kepler GPU.2013-12[2014-3].http://www.ocw.titech.ac.jp/index.php?module=General&action=DownLoad&file=20131226717065-477-1-45.pdf&type=cal&JWC=20131226717065
[3] OpenMP Architecture Review Board.OpenMP Application Program Interface.2013-7[2014-4].http://www.openmp.org/mp-documents/spec30.pdf
[4] CAPS entreprise.OpenACC Reference Manual CAPSCompilers 3.3.2012-12[2014-4].http://www.caps-entreprise.com/products/caps-compilers/
[5] Khronos OpenCL Working Group.The OpenCL Specification.2008-12[2014-4].https://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf
[6] OpenACC Group.The OpenACC Application Programming In-terface_v1.0.2011-11[2014-4].http://www.openacc.org/sites/default/files/OpenACC.1.0_0.pdf
[7] David A.Patterson John L.Hennessy and et al.Computer Architecture:A Quantitative Approach(第5版)[M].北京:机械工业出版社,2012:285-288
[8] Johnson N.EPCC OpenACC benchmark suite.2013-5[2014-4].https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openacc-benchmark-suite
[9] Kaltofen E L.The "Seven Dwarfs" of Symbolic Computation[C]∥Numerical and Symbolic Scientific Computing,2012.Wien:Springer Vienna,2012:95-104
[10] McCalpin J D.Stream:Sustainable memory bandwidth in high performance computers.2013-2[2014-4].http://www.cs.virginia.edu/stream/ref.html
[11] md rezaur rahman.The scalable heterogeneous computing ben-chmark suite (shoc) for intel xeon phi.2013-4[2014-4].https://software.intel.com/en-us/blogs/2013/03/20/the-scalable-heterogeneous-computing-benchmark-suite-shoc-for-intelr-xeon-phitm
[12] NVIDIA.CUDA C Programming Guide.2014-2(5.5)[2014-4].http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz342yBEw4Q
[13] Lin H,Scogland T,Zhang J,et al.OpenCL and the 13 Dwarfs:A Work in Progress[C]∥ICPE’12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering,2012.New York:ACM,2012:291-294
[14] Hoshinom T,Maruyama N,Takaki R.CUDA vs OpenACC:Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application[C]∥13th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing (CCGrid),2013.Delft:IEEE,2013:136-143
[15] Yang You,Fu Hao-huan,Huang Xiao-meng,et al.Accelerating the 3D Elastic Wave Forward Modeling on GPU and MIC[C]∥the 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum.Washington,2013:1088-1096

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed