Computer Science ›› 2017, Vol. 44 ›› Issue (1): 20-24.doi: 10.11896/j.issn.1002-137X.2017.01.004

Previous Articles     Next Articles

Evaluating Intel AVX2 Vgather Instructions with Stencils

LIN Xin-hua, QIN Qiang, LI Shuo, WEN Min-hua and MATSUOKA Satoshi   

  • Online:2018-11-13 Published:2018-11-13

Abstract: Intel provided AVX2 vgather instruction on Haswell CPU to better support reading discontinued data in vectorization.We found the compiler generates vgather instructions,which slow down the performance of Stencil on Haswell,because the branches exist in defining boundary condition of Stencils.We proposed to utilize peel optimization or intrinsic load to avoid these vgather instructions.We applied these optimizations to three Stencil benchmarks,a long-range Stencil 3DFD,and a hybrid Stencil application,and archived the speedup from 1.22X to 3.88X on Haswell.By ana-lyzing the implementation of the instruction,we found the vgather instructions are decoded into multiple micro-operations (μops),and the instructions generate one μops for each element to be gathered.Due to the high overhead of deco-der,the vgather instructions become the performance bottleneck of Stencils on Haswell.It is believed that the understanding of the implementation of AVX2 vgather instructions and adopting the optimizations to avoid the vgather instructions are quite helpful for performance tuning the applications with good spatial locality on Haswell.

Key words: AVX2 vgather,Stencil,Performance evaluation

[1] HOFMANN J,TREIBIG J,HAGER G,et al.Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator[C]∥27th International Conference on Architecture of Computing Systems (ARCS2014).VDE,2014.
[2] PENNYCOOK S J,HUGHES C J,Smelyanskiy M,et al.Exploring SIMD for Molecular Dynamics,Using Intel Xeon Processors and Intel Xeon Phi Coprocessors[C]∥IPDPS’13.IEEE,2013:1085-1097.
[3] HOFMANN J,TREIBIG J,HAGER G,et al.Comparing the per-formance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycorechips[C]∥Proceedings of the 2014 Workshop on Programming models for SIMD/Vector Processing(WPMVP’14).New York,2014.
[4] KUSSWURM D.Modern X86 Assembly Language Program-ming 32bit,64bit,SSE,and AVX[M].Apress,2014.
[5] IACA.https://software.intel.com/en-us/articles/intel-ar-chitecture-code-analyzer.
[6] 3DFD.https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso.
[7] ZHANG C W J,TIAN Z.P- and s-wave separated elastic wave equation numerical modeling using 2d staggered-grid[C]∥SEG/San Antonio 2007 Annual Meeting.2007.
[8] AVX2-vgather的部分源代码以及IACA结果.https://github.com/jameslinsjtu/AVX2-vgather.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .