计算机科学 ›› 2015, Vol. 42 ›› Issue (11): 37-42.doi: 10.11896/j.issn.1002-137X.2015.11.006

• 2014年全国高性能计算机学术年会 • 上一篇    下一篇

Intel Knights Corner的结点级内存访问优化

林新华,李 硕,赵嘉明,松岗聪   

  1. 上海交通大学高性能计算中心 上海200240;东京工业大学学术国际情报中心 东京152-8550,Intel公司软件与服务部门 波特兰999039,上海交通大学高性能计算中心 上海200240,东京工业大学学术国际情报中心 东京152-8550
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家高技术研究发展计划(863):高性能计算环境应用服务优化关键技术研究,日本学术振兴会RONPAKU Fellowship资助

Node-level Memory Access Optimization on Intel Knights Corner

LIN Xin-hua, LI Shuo, ZHAO Jia-ming and M ATSUOKA Satoshi   

  • Online:2018-11-14 Published:2018-11-14

摘要: 传统编程优化(Traditional Programming Optimization,TPO)在Intel Knights Corner(KNC)上收效甚微,因此提出内存访问优化(Memory Access Optimization,MAO)。将MAO应用到已经过TPO的程序Diffusion 3D上,发现其性能仍然提高了39.1%。主要有2个贡献:1)提出MAO,认为TPO+MAO有助于在KNC上获取最优化性能;2)发现对于stencil代码,基于intrinsic的MAO比基于编译器的MAO更高效。这些发现对于在KNC上优化大规模应用有启发意义。

关键词: 传统编程优化,Intel Knights Corner,内存访问优化,最优化性能

Abstract: Traditional programming optimization (TPO) has limited effects on Intel Knights Corner (KNC).Therefore,we proposed memory access optimization (MAO) for KNC.We applied MAO to TPO version of Diffusion 3D,and its performance is improved by 39.1%.We made two contributions in this paper:1) MAO is indispensable to KNC and TPO+MAO is the path to Ninja Performance—the best optimized performance.2) Intrinsic-based MAO is more efficient to stencil code than compiler-based MAO.Our findings on MAO will inspire optimizations of large-scale applications on KNC.

Key words: Traditional programming optimization(TPO),Intel Knights Corner(KNC),Memory access optimization(MAO),Ninja performance

[1] Satish N,Kim C,Chhugani J,et al.Can traditional programming bridge the Ninja performance gap for parallel computing applications?[C]∥2012 39th Annual International Symposium on Computer Architecture (ISCA).2012:440-451
[2] Xue W,Yang C,Fu H,et al.Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2[C]∥ Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium.2014
[3] Pennycook S J,Hughes C J,Smelyanskiy M,et al.ExploringSIMD for Molecular Dynamics,Using Intel Xeon Processors and Intel Xeon Phi Coprocessors[C]∥Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.2013:1085-1097
[4] Heinecke A,Vaidyanathan K,Smelyanskiy M,et al.Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel Xeon Phi Coprocessor[C]∥Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.2013:126-137
[5] Krishnaiyer R,Kultursay E,Chawla P,et al.Compiler-BasedData Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor[C]∥Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum.2013:1575-1586
[6] Hofmann J,Treibig J,Hager G,et al.Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accele-rator[C]∥2014 27th International Conference on Presented at the Architecture of Computing Systems (ARCS).2014:1-8
[7] Jeffers J,Reinders J.Intel Xeon Phi Coprocessor High Performance Programming(1st edition)[M].Morgan Kaufmann Publishers Inc,2013
[8] Rahman R.Intel Xeon Phi Coprocessor Architecture and Tools:The Guide for Application Developers[M]∥Intel Xeon Phi Coprocessor Architecture and Tools:The Guide for Application Developers(1st edition).2013
[9] Saini S,Jin H,Jespersen D,et al.An early performance evaluation of many integrated core architecture based SGI rackable computing system[C]∥Proceedings of the International Confe-rence on High Performance Computing,Networking,Storage and Analysis.2013
[10] Hofmann J.Performance Evaluation of the Intel ManyIntegrated Core Architecture for 3D Image Reconstruction in Computed Tomography(Master Thesis)[M].Friedrich-Alexander-University Erlangen-Nuremberg,2010
[11] Fang J,Sips H,Zhang L,et al.Test-driving Intel Xeon Phi[C]∥Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering.New York,USA,2014:137-148
[12] SHOC-MIC benchmark.https://github.com/vetter/shoc-mic
[13] Likwid.https://code.google.com/p/likwid/
[14] PAPI.http://icl.cs.utk.edu/papi/
[15] Ramos S,Hoefler T.Modeling communication in cache-coherent SMP systems:a case-study with Xeon Phi[C]∥Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing.New York,USA,2013:97
[16] Hoefler T,Gropp W,Kramer W,et al.Performance modeling for systematic performance tuning[C]∥2011 International Confe-rence for High Performance Computing,Networking,Storage and Analysis (SC).2011:1-12

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!