计算机科学 ›› 2019, Vol. 46 ›› Issue (8): 95-99.doi: 10.11896/j.issn.1002-137X.2019.08.015

• 2018 全国高性能计算学术年会 • 上一篇    下一篇

一种ARM处理器面向高性能计算的性能评估

王一超1, 廖秋承1, 左思成2, 谢锐1, 林新华1   

  1. (上海交通大学网络信息中心 上海200240)1
    (上海交通大学电子信息与电气工程学院 上海200240)2
  • 收稿日期:2019-01-20 出版日期:2019-08-15 发布日期:2019-08-16
  • 通讯作者: 林新华(1979-),男,博士,高级工程师,主要研究方向为高性能计算与体系架构,E-mail:james@sjtu.edu.cn
  • 作者简介:王一超(1990-),男,硕士,工程师,主要研究方向为高性能计算中的性能优化问题,E-mail:wangyichao@sjtu.edu.cn;廖秋承(1994-),男,助理工程师,主要研究方向为计算机体系架构;左思成(1997-),男,主要研究方向为高性能计算;谢锐(1974-),男,硕士,高级工程师,主要研究方向为计算机网络
  • 基金资助:
    国家重点研发计划(2018YF0404603)

Performance Evaluation of ARM-ISA SoC for High Performance Computing

WANG Yi-chao1, LIAO Qiu-cheng1, ZUO Si-cheng2, XIE Rui1, LIN Xin-hua1   

  1. (Network & Information Center,Shanghai Jiao Tong University,Shanghai 200240,China)1
    (School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China)2
  • Received:2019-01-20 Online:2019-08-15 Published:2019-08-16

摘要: 为探索ARM架构在高效能“绿色计算”领域中,面向高性能计算的应用价值,对一款ARM指令集处理器进行性能评估,并将其与主流商用处理器Intel Xeon进行性能对比。在微架构上,测试了该处理器的浮点数计算能力、访存带宽及延迟。实验结果显示,该处理器的双精度浮点计算能力约为475 GFLOPS,相较于Xeon E5-2680v3,低了33%,访存带宽约为105GB/s,优于Xeon平台。在应用层面,选取4个高性能计算领域的典型应用,包含Stencil并行计算方法等,在该处理器实现移植和编译,并采用线程绑定的运行方法,提升缓存局部性,优化计算性能。实验结果显示,ARM指令集处理器的应用移植简单,其优化思路与主流商用处理器(如Intel Xeon)类似,但在计算密集和随机访存型应用上存在提升空间,在Stencil应用上性能近似,结合低功耗特点,在“绿色计算”领域具有竞争力。后续将持续基于最新的ARM指令集芯片做相关研究。

关键词: ARMv8, 处理器, 性能评估

Abstract: In order to compare the performance of Intel Xeon processor for high performance computing,this paper eva-luated an ARM-ISA based-SoC floating point computing capacity,memory access bandwidth and latency.Computing capacity of double floating point on this is about 475 GFLOPS that is only 66% of Intel Xeon E5-2680v3.Memory bandwidth is 105 GB/s,better than Xeon processor.Moreover,this paper ported 4 scientific computing applications including stencil method on this SoC.The experiments show that the performance of two stencil applications on this SoC is close to that on Intel Xeon processors,and thread mapping for cache locality is a kind of performance optimization methods for this SoC.More performance study later on the new generation ARM Server SoC will be explored

Key words: ARMv8, performance evaluation, Processor

中图分类号: 

  • TP391
[1]JACKSON A,TURNER A,WEILAND M,et al.Evaluating the Arm Ecosystem for High Performance Computing[C]∥Platform for Advanced Scientific Computing (PASC) Conference.Zurich,Swiss:ACM,2019:1-18.
[2]MCINTOSH-SMITH S,PRICE J,DEAKIN T,et al.Compara- tive Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard[C]∥Cray User Group (CUG) Conference.2018.
[3]YOSHIDA T.Fujitsu high performance CPU for the Post-K Computer[C]∥Hot Chips 30 Symposium (HCS).Cupertino,US:IEEE,2018.
[4]STEPHENS N,BILES S,BOETTCHER M,et al.The ARM Scalable Vector Extension[C]∥IEEE Micro.Boston,US:IEEE,2017.
[5]MCCORMICK P S,BRAITHWAITE R K,FENG W.Empirical Memory-Access Cost Models in Multicore NUMA Architectures[C]∥International Conference on Parallel Processing (ICPP).Taipei:2011.
[6]LAURENZANO M A,TIWARI A,CAUBLE-CHANTRENNE A,et al.Characterization and bottleneck analysis of a 64-bit ARMv8 platform[C]∥ISPASS 2016 - International Symposium on Performance Analysis of Systems and Software.2016.
[7]MALLINSON A C,BECKINGSALE D A,GAUDIN W P,et al.CloverLeaf:Preparing Hydrodynamics Codes for Exascale[C]∥CRAY User Group.2013.
[8]MCINTOSH-SMITH S,MARTINEAU M,DEAKIN T,et al.TeaLeaf:A mini-application to enable design-space explorations for iterative sparse linear solvers[C]∥Proceedings of IEEE International Conference on Cluster Computing.ICCC,2017.
[9]ZERR R,BAKER R.SNAP:SN (discrete ordinates) application proxy:Description[R].2013.
[10]MARTINEAU M,MCINTOSH-SMITH S.Exploring On-Node Parallelism with Neutral,a Monte Carlo Neutral Particle Transport Mini-App[C]∥Proceedings of IEEE International Confe-rence on Cluster Computing.ICCC,2017.
[11]PARLETT B N.LINPACK Users’ Guide (J.J.Dongarra,J.R.Bunch,C.B.Moler and G.W.Stewart)[M].Philadelphia:SIAM Review,2005.
[12]MCCALPIN J D.Memory Bandwidth and Machine Balance in Current High Performance Computers[J].IEEE ComputerSocie-ty Technical Committee on Computer Architecture Newsletter,1995,2:19-25.
[13]MCVOY L,STAELIN C.lmbench:Portable Tools for Perfor- mance Analysis[C]∥Proceedings of the USENIX Annual Technical Conference.1996.
[14]LIU J,WU J,PANDA D K.High performance RDMA-based MPI implementation over InfiniBand[C]∥International Journal of Parallel Programming.2004.
[15]LIN X H,WANG Y C,QIN Q,et al.Modeling and Evaluating Intel IMCI Vgather Instruction using Stencils[J].Computer Engineering & Science,2016,38(9):1741-1747.(in Chinese) 林新华,王一超,秦强,等.利用Stencil建模及评估Intel IMCI vgather指令[J].计算机工程与科学,2016,38(9):1741-1747.
[1] 郭拯危, 付泽文, 李宁, 白澜.
高分辨率斜视聚束SAR回波仿真加速算法研究
Study on Acceleration Algorithm for Raw Data Simulation of High Resolution Squint Spotlight SAR
计算机科学, 2022, 49(8): 178-183. https://doi.org/10.11896/jsjkx.210600066
[2] 刘云, 董守杰.
基于CUDA核函数的多路视频图像拼接加速算法
Acceleration Algorithm of Multi-channel Video Image Stitching Based on CUDA Kernel Function
计算机科学, 2022, 49(6A): 441-446. https://doi.org/10.11896/jsjkx.210600043
[3] 刘林云, 陈开颜, 李雄伟, 张阳, 谢方方.
基于卷积神经网络的旁路密码分析综述
Overview of Side Channel Analysis Based on Convolutional Neural Network
计算机科学, 2022, 49(5): 296-302. https://doi.org/10.11896/jsjkx.210300286
[4] 瞿伟, 余飞鸿.
基于多核处理器的非对称嵌入式系统研究综述
Survey of Research on Asymmetric Embedded System Based on Multi-core Processor
计算机科学, 2021, 48(6A): 538-542. https://doi.org/10.11896/jsjkx.200900204
[5] 陈孟东, 郭东升, 谢向辉, 吴东.
基于异构计算平台的规则处理器的设计与实现
Design and Implementation of Rule Processor Based on Heterogeneous Computing Platform
计算机科学, 2020, 47(4): 312-317. https://doi.org/10.11896/jsjkx.190300104
[6] 陶小涵, 庞建民, 高伟, 王琦, 姚金阳.
基于SW26010处理器的FT程序的性能优化
Performance Optimization of FT Program Based on SW26010 Processor
计算机科学, 2019, 46(4): 321-328. https://doi.org/10.11896/j.issn.1002-137X.2019.04.050
[7] 罗殊彦, 朱怡安, 曾诚.
嵌入式异构多核处理器核间的通信性能评估与优化
Performance Evaluation and Optimization of Inter-cores Communication for Heterogeneous
Multi-core Processor Unit
计算机科学, 2018, 45(6A): 262-265.
[8] 高放,黄樟钦.
基于异构多核并行加速的嵌入式神经网络人脸识别方法
Embedded Neural Network Face Recognition Method Based on Heterogeneous Multicore Parallel Acceleration
计算机科学, 2018, 45(3): 288-293. https://doi.org/10.11896/j.issn.1002-137X.2018.03.047
[9] 朱君鹏, 李晖, 陈梅, 戴震宇.
SNS:一种快速无偏的分层图抽样算法
SNS:A Fast and Unbiased Stratified Graph Sampling Algorithm
计算机科学, 2018, 45(11): 249-255. https://doi.org/10.11896/j.issn.1002-137X.2018.11.039
[10] 马飞越,游洪,佃松宜,杨家勇,彭新智,王博,丁培.
一种用于气体绝缘开关设备异物清扫与检测的机器人系统
Robot System for GIS Foreign Body Clean and Cavity Detection
计算机科学, 2017, 44(Z11): 592-595. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.127
[11] 李红军,崔西宁,牟明,韩伟.
一种面向分布式嵌入式计算机的性能评估模型
Research on Distributed Embedded Computer Performance Evaluation Model
计算机科学, 2017, 44(4): 153-156. https://doi.org/10.11896/j.issn.1002-137X.2017.04.033
[12] 唐滔,彭林,黄春,杨灿群.
面向存储层次设计优化的GPU程序性能分析
Performance Analysis of GPU Programs Towards Better Memory Hierarchy Design
计算机科学, 2017, 44(12): 1-10. https://doi.org/10.11896/j.issn.1002-137X.2017.12.001
[13] 孟德龙,文敏华,韦建文,林新华.
神威太湖之光上OpenFOAM的移植与优化
Porting and Optimizing OpenFOAM on Sunway TaihuLight System
计算机科学, 2017, 44(10): 64-70. https://doi.org/10.11896/j.issn.1002-137X.2017.10.012
[14] 王伟,王嘉郡,王明明,张文静,陈金广.
以网络性能为核心的移动自组网Flooding攻击防御技术
Defense Technology Based on Dynamic Space-Time Performance for Flooding Attacks in Mobile Ad Hoc Networks
计算机科学, 2017, 44(1): 159-166. https://doi.org/10.11896/j.issn.1002-137X.2017.01.031
[15] 林新华,秦强,李硕,文敏华,松岗聪.
使用Stencil评估Intel AVX2 Vgather指令
Evaluating Intel AVX2 Vgather Instructions with Stencils
计算机科学, 2017, 44(1): 20-24. https://doi.org/10.11896/j.issn.1002-137X.2017.01.004
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!