计算机科学 ›› 2015, Vol. 42 ›› Issue (11): 37-42.doi: 10.11896/j.issn.1002-137X.2015.11.006

• 2014年全国高性能计算机学术年会 • 上一篇    下一篇

Intel Knights Corner的结点级内存访问优化

林新华,李 硕,赵嘉明,松岗聪   

  1. 上海交通大学高性能计算中心 上海200240;东京工业大学学术国际情报中心 东京152-8550,Intel公司软件与服务部门 波特兰999039,上海交通大学高性能计算中心 上海200240,东京工业大学学术国际情报中心 东京152-8550
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家高技术研究发展计划(863):高性能计算环境应用服务优化关键技术研究,日本学术振兴会RONPAKU Fellowship资助

Node-level Memory Access Optimization on Intel Knights Corner

LIN Xin-hua, LI Shuo, ZHAO Jia-ming and M ATSUOKA Satoshi   

  • Online:2018-11-14 Published:2018-11-14

摘要: 传统编程优化(Traditional Programming Optimization,TPO)在Intel Knights Corner(KNC)上收效甚微,因此提出内存访问优化(Memory Access Optimization,MAO)。将MAO应用到已经过TPO的程序Diffusion 3D上,发现其性能仍然提高了39.1%。主要有2个贡献:1)提出MAO,认为TPO+MAO有助于在KNC上获取最优化性能;2)发现对于stencil代码,基于intrinsic的MAO比基于编译器的MAO更高效。这些发现对于在KNC上优化大规模应用有启发意义。

关键词: 传统编程优化,Intel Knights Corner,内存访问优化,最优化性能

Abstract: Traditional programming optimization (TPO) has limited effects on Intel Knights Corner (KNC).Therefore,we proposed memory access optimization (MAO) for KNC.We applied MAO to TPO version of Diffusion 3D,and its performance is improved by 39.1%.We made two contributions in this paper:1) MAO is indispensable to KNC and TPO+MAO is the path to Ninja Performance—the best optimized performance.2) Intrinsic-based MAO is more efficient to stencil code than compiler-based MAO.Our findings on MAO will inspire optimizations of large-scale applications on KNC.

Key words: Traditional programming optimization(TPO),Intel Knights Corner(KNC),Memory access optimization(MAO),Ninja performance

[1] Satish N,Kim C,Chhugani J,et al.Can traditional programming bridge the Ninja performance gap for parallel computing applications?[C]∥2012 39th Annual International Symposium on Computer Architecture (ISCA).2012:440-451
[2] Xue W,Yang C,Fu H,et al.Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2[C]∥ Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium.2014
[3] Pennycook S J,Hughes C J,Smelyanskiy M,et al.ExploringSIMD for Molecular Dynamics,Using Intel Xeon Processors and Intel Xeon Phi Coprocessors[C]∥Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.2013:1085-1097
[4] Heinecke A,Vaidyanathan K,Smelyanskiy M,et al.Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel Xeon Phi Coprocessor[C]∥Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.2013:126-137
[5] Krishnaiyer R,Kultursay E,Chawla P,et al.Compiler-BasedData Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor[C]∥Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum.2013:1575-1586
[6] Hofmann J,Treibig J,Hager G,et al.Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accele-rator[C]∥2014 27th International Conference on Presented at the Architecture of Computing Systems (ARCS).2014:1-8
[7] Jeffers J,Reinders J.Intel Xeon Phi Coprocessor High Performance Programming(1st edition)[M].Morgan Kaufmann Publishers Inc,2013
[8] Rahman R.Intel Xeon Phi Coprocessor Architecture and Tools:The Guide for Application Developers[M]∥Intel Xeon Phi Coprocessor Architecture and Tools:The Guide for Application Developers(1st edition).2013
[9] Saini S,Jin H,Jespersen D,et al.An early performance evaluation of many integrated core architecture based SGI rackable computing system[C]∥Proceedings of the International Confe-rence on High Performance Computing,Networking,Storage and Analysis.2013
[10] Hofmann J.Performance Evaluation of the Intel ManyIntegrated Core Architecture for 3D Image Reconstruction in Computed Tomography(Master Thesis)[M].Friedrich-Alexander-University Erlangen-Nuremberg,2010
[11] Fang J,Sips H,Zhang L,et al.Test-driving Intel Xeon Phi[C]∥Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering.New York,USA,2014:137-148
[12] SHOC-MIC benchmark.https://github.com/vetter/shoc-mic
[13] Likwid.https://code.google.com/p/likwid/
[14] PAPI.http://icl.cs.utk.edu/papi/
[15] Ramos S,Hoefler T.Modeling communication in cache-coherent SMP systems:a case-study with Xeon Phi[C]∥Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing.New York,USA,2013:97
[16] Hoefler T,Gropp W,Kramer W,et al.Performance modeling for systematic performance tuning[C]∥2011 International Confe-rence for High Performance Computing,Networking,Storage and Analysis (SC).2011:1-12

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .