计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 321-328.doi: 10.11896/j.issn.1002-137X.2019.04.050

• 交叉与前沿 • 上一篇    下一篇

基于SW26010处理器的FT程序的性能优化

陶小涵, 庞建民, 高伟, 王琦, 姚金阳   

  1. 信息工程大学 郑州450001 数学工程与先进计算国家重点实验室 江苏 无锡214125
  • 收稿日期:2018-02-11 出版日期:2019-04-15 发布日期:2019-04-23
  • 通讯作者: 庞建民(1964-),男,博士,教授,博士生导师,主要研究方向为高性能计算、先进编译技术,E-mail:jianmin_pang@126.com(通信作者)
  • 作者简介:陶小涵(1996-),男,博士生,主要研究方向为高性能计算、先进编译技术,E-mail:txh_0119@126.com;高 伟(1988-),男,博士,主要研究方向为高性能计算、先进编译技术;王 琦(1992-),男,硕士生,主要研究方向为高性能计算、先进编译技术;姚金阳(1992-),男,硕士生,主要研究方向为高性能计算、先进编译技术。
  • 基金资助:
    本文受国家重点研发计划“高性能计算”重点专项(2016YFB0200503)资助。

Performance Optimization of FT Program Based on SW26010 Processor

TAO Xiao-han, PANG Jian-min, GAO Wei, WANG Qi, YAO Jin-yang   

  1. Information Engineering University,Zhengzhou 450001,China
    State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214125,China
  • Received:2018-02-11 Online:2019-04-15 Published:2019-04-23

摘要: “神威·太湖之光”是中国自主研发的超级计算机,其处理器芯片为国人自主研发的SW26010异构众核处理器,每个处理器内含有4个核组,每个核组包括1个主核和64个从核。NPB-FT程序的功能是利用快速傅立叶变换求解三维偏微分方程,其被广泛用于评测集群的计算和集合能力,因此选用FT程序对“神威·太湖之光”提供的多层次并行资源和体系架构的性能进行测试具有重要的意义。首先,利用加速线程库将程序改写为主从版本,使计算核心能够在从核上执行;其次,利用从核的寄存器通信以及主从核间的数据传输通道,消除FT程序中的数据转置过程;然后,实现了计算与通信隐藏,避免了核间通信时核内的计算资源处于空闲状态;最后,利用向量化和指令流水技术,提升程序的数据级并行和指令级并行。实验结果为:单核上3D-32规模的加速比为66,64核上3D-512规模的加速比为20,256核上3D-2048规模的加速比为46。

关键词: SW26010处理器, 傅立叶变换, 寄存器通信, 通信隐藏

Abstract: Sunway TaihuLight is a supercomputer independently developed by China.Its processor is SW26010 heterogeneous many-core processor,which is also independently developed by Chinese.Each processor includes four core-groups,and each core-group includes one management processing element(MPE) and 64 computing processing elements(CPEs).The function of NPB-FT program is to solve three-dimensional partial differential equations by using Fast Fourier Transform,and it is widely used in the evaluation of cluster computing and aggregation capabilities.Therefore,it is of great importance to use the FT program to analyze the multi-level parallel resources provided by Sunway TaihuLight and the performance of the architecture.First of all,the program is rewritten as master-slave version by accelerating athread library,so that the program core can be executed by the CPEs.Second,the data transposition process in the FT program is eliminated by using register communication of CPEs and the data transmission channel between the MPE and CPEs.Further,the computing and communication hiding are realized to avoid the computing resources in the core being in idle state while communicating between cores.Finally,the vectorization and instruction flow technology are used to enhance the program’s data-level and instruction-level parallelism.The experimental results show that the 3D-32 program executing on a single core has an acceleration ratio of 66.The acceleration ratio of 3D-512 program executing on 64 cores is 20 while the acceleration ratio of 3D-2048 program executing on 256 cores is 46.

Key words: Communication hiding, Fourier transform, Register communication, SW26010 processor

中图分类号: 

  • TP301.6
[1]DONGARRA J.Report on the Sunway Taihu Light System: UIEECS-16-742.Knoxville:University of Tennessee,2016.
[2]HONG W J,LI K L,QUAN Z,et al.PETSc’s Heterogeneous Parallel Algorithm Design and Performance Optimization on the Sunway TaihuLight System[J].Chinese Journal of Computers,2017,40(9):2057-2069.(in Chinese) 洪文杰,李肯立,全哲,等.面向神威·太湖之光的PETSc可扩展异构并行算法及其性能优化[J].计算机学报,2017,40(9):2057-2069.
[3]YUAN W,ZHANG Y Q,SUN J C,et al.Perfomance Analysis of NPB Benchmark on Domestic Tera-Scale Cluster Systems[J].Journal of Computer Research and Development,2005,42(6):1079-1084.(in Chinese) 袁伟,张云泉,孙家昶,等.国产万亿次机群系统NPB性能测试分析[J].计算机研究与发展,2005,42(6):1079-1084.
[4]FANG W,SUN G Z,WU C,et al.A Parallel Algorithm of Three-Dimensional Fast Fourier Transform [J].Journal of Computer Research and Development,2011,48(3):440-446.(in Chinese) 方维,孙广中,吴超,等.一种三维快速傅里叶变换并行算法[J].计算机研究与发展,2011,48(3):440-446.
[5]WU Y W.Research on Parallel Computing Model for CPU/GPU Heterogeneous System[D].Changsha:National University of Defense Technology,2012.(in Chinese) 吴勇文.CPU/GPU异构集群并行计算模型研究[D].长沙:国防科学技术大学,2012.
[6]CHAO Y.Peta-scale fully-implicit solver for nonhydrostatic atmospheric dynamics with 8.5M Cores[C]∥Proc. of SC’16,2016.
[7]ZHENG F,XU Y,LI H L,et al.A homegrown many-core processor architecture for high-performance computing[J].SCIENTIA SINICA Informations,2015,45(4):523-534.(in Chinese) 郑方,许勇,李宏亮,等.一种面向高性能计算的自主众核处理器结构[J].中国科学:信息科学,2015,45(4):523-534.
[8]DONGARRA J.Sunway Taihu Light super-computer makes its appearance[J].National Science Review,2016,3(3):265-266.
[9]YAO W J,CHEN J S,SU Z C,et al.Porting and optimizing of NAMD on Sunway Tai huLight System[J].Computer Engineering & Science,2017,39(6):1022-1030.(in Chinese) 姚文军,陈俊仕,苏志超,等.基于神威太湖之光的NAMD软件的移植与优化[J].计算机工程与科学,2017,39(6):1022-1030.
[10]FU H H,LIAO J F,YANG J Z,et al.The Sunway TaihuLight supercomputer:system and applications[J].Science China Information Sciences,2016,59(7):072001:1-072001:16.
[11]YAO W J.Implementation and Optimization of Molecular Dynamics Application on Sunway TaihuLight Supercomputer[D].Hefei:University of Science and Technology of China,2017.(in Chinese) 姚文军.神威·太湖之光上分子动力学软件的实现与优化[D].合肥:中国科学技术大学,2017.
[12]ZHAO M T,LIU Y,LIU R,et al.Acceleration of histogram of oriented gradient (HOG) based on Sunway many-core processor[J].Computer Engineering & Science,2017,39(4):611-618.(in Chinese) 赵美婷,刘轶,刘锐,等.基于申威众核处理器的HOG特征提取算法并行加速[J].计算机工程与科学,2017,39(4):611-618.
[13]WU M C,HUANG L,LIU Y,et al.An OpenCL Compiler for the Homegrown Heterogeneous Many-cor Processor on the Sunway TaihuLight Supercomputer[J].Chinese Journal of Computers,2018,41(10):2236-2250.(in Chinese) 伍明川,黄磊,刘颖,等.面向神威·太湖之光的国产异构众核处理器OpenCL编译系统[J].计算机学报,2018,41(10):2236-2250.
[14]SCHLEGEL B,GEMULLA R,LEHNER W.Fast integer compression using SIMD instructions[C]∥International Workshop on Data Management on New Hardware.ACM,2010:34-40.
[15]STOJANOV A,TOSKOV I,ROMPF T,et al.SIMD intrinsics on managed language runtimes[C]∥International Symposium.2018:2-15.
[16]MENG D L,WEN M H,WEI J W,et al.Porting and Optimizing OpenFOAM on Sunway TaihuLight System[J].Computer Science,2017,44(10):64-70.(in Chinese) 孟德龙,文敏华,韦建文,林新华.神威太湖之光上OpenFOAM的移植与优化[J].计算机科学,2017,44(10):64-70.
[1] 谢景明, 胡伟方, 韩林, 赵荣彩, 荆丽娜.
基于“嵩山”超级计算机系统的量子傅里叶变换模拟
Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System
计算机科学, 2021, 48(12): 36-42. https://doi.org/10.11896/jsjkx.201200023
[2] 郭超,杨燕,金炜东.
基于EDBN-SVM的高速列车故障分析
Fault Analysis of High Speed Train Based on EDBN-SVM
计算机科学, 2016, 43(12): 281-286. https://doi.org/10.11896/j.issn.1002-137X.2016.12.052
[3] 李焱,张云泉,王可,赵美超.
异构平台上基于OpenCL的FFT实现与优化
Implementation and Optimization of the FFT Using OpenCL on Heterogeneous Platforms
计算机科学, 2011, 38(8): 284-286.
[4] 孙菁,杨静宇,傅德胜.
彩色图像四元数频域幅值调制水印算法
Watermarking Algorithm for Color Images Based on Quaternion Frequency Modulation
计算机科学, 2011, 38(3): 123-126.
[5] 马洁,李建福.
基于混沌映射的视频数字水印算法
Novel Video Watermarking Algorithm Based on MPEG7 Contour Description
计算机科学, 2010, 37(9): 287-289.
[6] 王彦伟,黄正东,马露杰.
基于FFT的三维CAD模型形状描述
Shape Description of 3D CAD Models Using FFT
计算机科学, 2010, 37(7): 251-254259.
[7] 楼天良.
快速傅里叶变换的DSP实现及代码优化

计算机科学, 2008, 35(7): 255-256.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!