Computer Science ›› 2019, Vol. 46 ›› Issue (4): 321-328.doi: 10.11896/j.issn.1002-137X.2019.04.050

• Interdiscipline & Frontier • Previous Articles     Next Articles

Performance Optimization of FT Program Based on SW26010 Processor

TAO Xiao-han, PANG Jian-min, GAO Wei, WANG Qi, YAO Jin-yang   

  1. Information Engineering University,Zhengzhou 450001,China
    State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214125,China
  • Received:2018-02-11 Online:2019-04-15 Published:2019-04-23

Abstract: Sunway TaihuLight is a supercomputer independently developed by China.Its processor is SW26010 heterogeneous many-core processor,which is also independently developed by Chinese.Each processor includes four core-groups,and each core-group includes one management processing element(MPE) and 64 computing processing elements(CPEs).The function of NPB-FT program is to solve three-dimensional partial differential equations by using Fast Fourier Transform,and it is widely used in the evaluation of cluster computing and aggregation capabilities.Therefore,it is of great importance to use the FT program to analyze the multi-level parallel resources provided by Sunway TaihuLight and the performance of the architecture.First of all,the program is rewritten as master-slave version by accelerating athread library,so that the program core can be executed by the CPEs.Second,the data transposition process in the FT program is eliminated by using register communication of CPEs and the data transmission channel between the MPE and CPEs.Further,the computing and communication hiding are realized to avoid the computing resources in the core being in idle state while communicating between cores.Finally,the vectorization and instruction flow technology are used to enhance the program’s data-level and instruction-level parallelism.The experimental results show that the 3D-32 program executing on a single core has an acceleration ratio of 66.The acceleration ratio of 3D-512 program executing on 64 cores is 20 while the acceleration ratio of 3D-2048 program executing on 256 cores is 46.

Key words: Communication hiding, Fourier transform, Register communication, SW26010 processor

CLC Number: 

  • TP301.6
[1]DONGARRA J.Report on the Sunway Taihu Light System: UIEECS-16-742.Knoxville:University of Tennessee,2016.
[2]HONG W J,LI K L,QUAN Z,et al.PETSc’s Heterogeneous Parallel Algorithm Design and Performance Optimization on the Sunway TaihuLight System[J].Chinese Journal of Computers,2017,40(9):2057-2069.(in Chinese) 洪文杰,李肯立,全哲,等.面向神威·太湖之光的PETSc可扩展异构并行算法及其性能优化[J].计算机学报,2017,40(9):2057-2069.
[3]YUAN W,ZHANG Y Q,SUN J C,et al.Perfomance Analysis of NPB Benchmark on Domestic Tera-Scale Cluster Systems[J].Journal of Computer Research and Development,2005,42(6):1079-1084.(in Chinese) 袁伟,张云泉,孙家昶,等.国产万亿次机群系统NPB性能测试分析[J].计算机研究与发展,2005,42(6):1079-1084.
[4]FANG W,SUN G Z,WU C,et al.A Parallel Algorithm of Three-Dimensional Fast Fourier Transform [J].Journal of Computer Research and Development,2011,48(3):440-446.(in Chinese) 方维,孙广中,吴超,等.一种三维快速傅里叶变换并行算法[J].计算机研究与发展,2011,48(3):440-446.
[5]WU Y W.Research on Parallel Computing Model for CPU/GPU Heterogeneous System[D].Changsha:National University of Defense Technology,2012.(in Chinese) 吴勇文.CPU/GPU异构集群并行计算模型研究[D].长沙:国防科学技术大学,2012.
[6]CHAO Y.Peta-scale fully-implicit solver for nonhydrostatic atmospheric dynamics with 8.5M Cores[C]∥Proc. of SC’16,2016.
[7]ZHENG F,XU Y,LI H L,et al.A homegrown many-core processor architecture for high-performance computing[J].SCIENTIA SINICA Informations,2015,45(4):523-534.(in Chinese) 郑方,许勇,李宏亮,等.一种面向高性能计算的自主众核处理器结构[J].中国科学:信息科学,2015,45(4):523-534.
[8]DONGARRA J.Sunway Taihu Light super-computer makes its appearance[J].National Science Review,2016,3(3):265-266.
[9]YAO W J,CHEN J S,SU Z C,et al.Porting and optimizing of NAMD on Sunway Tai huLight System[J].Computer Engineering & Science,2017,39(6):1022-1030.(in Chinese) 姚文军,陈俊仕,苏志超,等.基于神威太湖之光的NAMD软件的移植与优化[J].计算机工程与科学,2017,39(6):1022-1030.
[10]FU H H,LIAO J F,YANG J Z,et al.The Sunway TaihuLight supercomputer:system and applications[J].Science China Information Sciences,2016,59(7):072001:1-072001:16.
[11]YAO W J.Implementation and Optimization of Molecular Dynamics Application on Sunway TaihuLight Supercomputer[D].Hefei:University of Science and Technology of China,2017.(in Chinese) 姚文军.神威·太湖之光上分子动力学软件的实现与优化[D].合肥:中国科学技术大学,2017.
[12]ZHAO M T,LIU Y,LIU R,et al.Acceleration of histogram of oriented gradient (HOG) based on Sunway many-core processor[J].Computer Engineering & Science,2017,39(4):611-618.(in Chinese) 赵美婷,刘轶,刘锐,等.基于申威众核处理器的HOG特征提取算法并行加速[J].计算机工程与科学,2017,39(4):611-618.
[13]WU M C,HUANG L,LIU Y,et al.An OpenCL Compiler for the Homegrown Heterogeneous Many-cor Processor on the Sunway TaihuLight Supercomputer[J].Chinese Journal of Computers,2018,41(10):2236-2250.(in Chinese) 伍明川,黄磊,刘颖,等.面向神威·太湖之光的国产异构众核处理器OpenCL编译系统[J].计算机学报,2018,41(10):2236-2250.
[14]SCHLEGEL B,GEMULLA R,LEHNER W.Fast integer compression using SIMD instructions[C]∥International Workshop on Data Management on New Hardware.ACM,2010:34-40.
[15]STOJANOV A,TOSKOV I,ROMPF T,et al.SIMD intrinsics on managed language runtimes[C]∥International Symposium.2018:2-15.
[16]MENG D L,WEN M H,WEI J W,et al.Porting and Optimizing OpenFOAM on Sunway TaihuLight System[J].Computer Science,2017,44(10):64-70.(in Chinese) 孟德龙,文敏华,韦建文,林新华.神威太湖之光上OpenFOAM的移植与优化[J].计算机科学,2017,44(10):64-70.
[1] FENG Yan, WANG Rui-cong. Quantum Voting Protocol Based on Quantum Fourier Transform Summation [J]. Computer Science, 2022, 49(5): 311-317.
[2] XIE Jing-ming, HU Wei-fang, HAN Lin, ZHAO Rong-cai, JING Li-na. Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System [J]. Computer Science, 2021, 48(12): 36-42.
[3] LIU Xiao-nan, JING Li-na, WANG Li-xin, WANG Mei-ling. Large-scale Quantum Fourier Transform Simulation Based on SW26010 [J]. Computer Science, 2020, 47(8): 93-97.
[4] XU Mao,HOU Jin,WU Pei-jun,LIU Yu-ling,LV Zhi-liang. Convolutional Neural Networks Based on Time-Frequency Characteristics for Modulation Classification [J]. Computer Science, 2020, 47(2): 175-179.
[5] GONG Tong-yan,ZHANG Guang-ting,JIA Hai-peng,YUAN Liang. High-performance Implementation Method for Even Basis of Cooley-Tukey FFT [J]. Computer Science, 2020, 47(1): 31-39.
[6] ZHOU Li-jun, LIU Xiao. Low-contrast Crack Detection Method Based on Fractional Fourier Transform [J]. Computer Science, 2019, 46(6A): 208-210.
[7] CHEN Li-li, ZHU Feng, SHENG Bin, CHEN Zhi-hua. Quality Evaluation of Color Image Based on Discrete Quaternion Fourier Transform [J]. Computer Science, 2018, 45(8): 70-74.
[8] LIU Dan, MA Xiu-rong and SHAN Yun-long. Digital Modulation Signal Recognition Method Based on ST-RFT Algorithm [J]. Computer Science, 2018, 45(5): 64-68.
[9] QUAN Li, HU Yue-li, ZHU An-ji and YAN Ming. Video Denoising Method Based on Improved Dual-domain Image Denoising [J]. Computer Science, 2016, 43(7): 294-296.
[10] WANG Ya-hui and YAN Song-yuan. New Quantum Algorithm for Breaking RSA [J]. Computer Science, 2016, 43(4): 24-27.
[11] GUO Chao, YANG Yan and JIN Wei-dong. Fault Analysis of High Speed Train Based on EDBN-SVM [J]. Computer Science, 2016, 43(12): 281-286.
[12] ZHANG Yan-hua and MA Xiao-hu. Robust Watermarking Algorithm Based on Fractional Fourier Transform and Spread Transform Dither Modulation [J]. Computer Science, 2016, 43(11): 200-204.
[13] LI Kun-lun, ZHANG Ya-xin, LIU Li-li and GENG Xue-fei. Palmprint Recognition Based on Improved PCA and SVM [J]. Computer Science, 2015, 42(Z11): 146-150.
[14] FAN Fu-you, YANG Guo-wu, ZHANG Yan and YANG Gang. Three-valued Quantum Elementary and Implementation of Quantum Fourier Transform Circuit [J]. Computer Science, 2015, 42(7): 57-61.
[15] XU Yi-yi and TANG Pei-he. Duplicate Data Remove Algorithm of Cloud Storage System Based on Fractional Fourier Transform [J]. Computer Science, 2015, 42(7): 174-177.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!