计算机科学 ›› 2017, Vol. 44 ›› Issue (1): 20-24.doi: 10.11896/j.issn.1002-137X.2017.01.004
林新华,秦强,李硕,文敏华,松岗聪
LIN Xin-hua, QIN Qiang, LI Shuo, WEN Min-hua and MATSUOKA Satoshi
摘要: 为了更好地在向量化时读取离散的数据,Intel在Haswell CPU提供了AVX2vgather指令。由于Stencil在设置边界条件时使用了条件判断,因此编译器生成了vgather指令,并降低了Stencil在Haswell上的性能。提出使用peel优化或intrinsic load的方法来避免vgather指令的生成,并把该方法应用到3个Stencil基准算例、长程Stencil 程序3DFD以及混合Stencil应用3DEW上。这些Stencil在Haswell上的性能都获得了1.22X至3.88X不等的提升。通过研究指令的实现,发现vgather指令会被解码成多个微操作(μops),并为每个要读入的元素生成一个μops。由于vgather指令解码时会产生较高的开销,导致vgather指令成为Stencil在Haswell上的性能瓶颈。了解AVX2 vgather指令的实现以及掌握避免生成vgather指令的优化方法,对在Haswell上调优具有良好空间局部性应用的性能有一定的参考价值。
[1] HOFMANN J,TREIBIG J,HAGER G,et al.Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator[C]∥27th International Conference on Architecture of Computing Systems (ARCS2014).VDE,2014. [2] PENNYCOOK S J,HUGHES C J,Smelyanskiy M,et al.Exploring SIMD for Molecular Dynamics,Using Intel Xeon Processors and Intel Xeon Phi Coprocessors[C]∥IPDPS’13.IEEE,2013:1085-1097. [3] HOFMANN J,TREIBIG J,HAGER G,et al.Comparing the per-formance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycorechips[C]∥Proceedings of the 2014 Workshop on Programming models for SIMD/Vector Processing(WPMVP’14).New York,2014. [4] KUSSWURM D.Modern X86 Assembly Language Program-ming 32bit,64bit,SSE,and AVX[M].Apress,2014. [5] IACA.https://software.intel.com/en-us/articles/intel-ar-chitecture-code-analyzer. [6] 3DFD.https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso. [7] ZHANG C W J,TIAN Z.P- and s-wave separated elastic wave equation numerical modeling using 2d staggered-grid[C]∥SEG/San Antonio 2007 Annual Meeting.2007. [8] AVX2-vgather的部分源代码以及IACA结果.https://github.com/jameslinsjtu/AVX2-vgather. |
No related articles found! |
|