Computer Science ›› 2021, Vol. 48 ›› Issue (11A): 699-704.doi: 10.11896/jsjkx.201200150

• Interdiscipline & Application • Previous Articles     Next Articles

Implementation and Optimization of Sunway1621 General Matrix Multiplication Algorithm

LI Shuang, ZHAO Rong-cai, WANG Lei   

  1. School of Computer Science,Zhongyuan University of Technology,Zhengzhou 450007,China
    Research Institute of Front Information Technology,Zhongyuan University of Technology,Zhengzhou 450007,China
    Henan Key Laboratory on Public Opinion Intelligent Analysis,Zhengzhou 450007,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:LI Shuang,born in 1992,postgraduate.Her main research interests include high performance computing and so on.
    WANG Lei,born in 1977,master,professor,master's tutor.His main research interests include high performance computing and basic mathematics function library design and optimization research and development.

Abstract: As the most basic library in high performance computing (HPC),BLAS plays an important role in scientific computation,AI and other applications.The GEMM-based level 3 BLAS is the core of the performance of the entire BLAS.At present,there is no high-performance BLAS library that can give full play to the advantages of Sunway1621.Aiming at the above problems,we realize the transplantation and optimization of GotoBLAS on Sunway1621.This paper presents an algorithm for core code optimization using SIMD vectorization,and performs optimization techniques such as data regrouping,blocking,register allocation,and vectorization instruction optimization.The optimal data block selection scheme using vectorization and cache-based optimization for SGEMM and DGEMM in Micro-Kernel is compared respectively.Our optimizations achieve an average speedup of 52.09X and 32.75X on single precision and double precision compared to GotoBLAS.

Key words: Algorithm implementation, GEMM, Program optimization, SIMD, Sunway1621

CLC Number: 

  • TP319
[1]GOTO K,GEIJN R A.Anatomy of high-performance matrixmultiplication[J].ACM Transactions on Mathematical Software (TOMS),2008,34(3):1-25.
[2]ZHANG X Y,WANG Q,ZHANG Y Q.Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor[C]//2012 IEEE 18th International Conference on Parallel and Distributed Systems.Singapore,2012:684-691.
[3]WANG E,ZHANG Q,SHEN B,et al.Intel math kernel library[M].High-Performance Computing on the Intel© Xeon Phi-.Springer,Cham,2014:167-188.
[4]AMD.2012.AMD Core Math Library[OL].http://developer.amd.com/tools/cpu/acml/pages/default.aspx.
[5]cuBLAS.Basic Linear Algebra on NVIDIA GPUs[OL].https://developer.nvidia.com/cublas.
[6]GOTO K,VAN DE GEIJN R.High-performance implementa-tion of the level-3 BLAS[J].ACM Transactions on Mathematical Software (TOMS),2008,35(1):1-14.
[7]JIANG M Q,ZHANG Y Q,SONG G,et al.Research on High Performance Implementation Mechanism of GOTOBLAS General Matrix-matrix Multiplication[J].Computer Engineering,2008(7):84-86,103.
[8]LIU H,LIU F F,ZHANG P,et al.Optimization of BLAS Level 3 Functions on SW1600[J].Computer System Application,2016,25(12):234-239.
[9]LIU Z,TIAN X.Vectorization of Matrix Multiplication forMulti-core Vector Processors[J].Chinese Journal of Compu-ters,2018,41(10):2251-2264.
[10]VAN ZEE F G,SMITH T M.Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods[J].ACM Transactions on Mathematical Software,2017,44(1):1-36.
[11]KIM K,COSTA T B,DEVECIM,et al.Designing vector-friendly compact BLAS and LAPACK kernels[C]//IEEE International Conference on High Performance Computing Data and Analytics.2017.
[12]Chengdu Sunway Technology Corporation Limited.2017.Sun-way1621 processor structure manual[OL].http://www.swcpu.cn/uploadfile/2018/0709/20180709030836489.pdf.
[1] YAO Jian-yu, ZHANG Yi-wei, ZHANG Guang-ting, JIA Hai-peng. High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD [J]. Computer Science, 2021, 48(12): 29-35.
[2] GONG Tong-yan,ZHANG Guang-ting,JIA Hai-peng,YUAN Liang. High-performance Implementation Method for Even Basis of Cooley-Tukey FFT [J]. Computer Science, 2020, 47(1): 31-39.
[3] ZHOU Bei, HUANG Yong-zhong, XU Jin-chen, GUO Shao-zhong. Study on SIMD Method of Vector Math Library [J]. Computer Science, 2019, 46(1): 320-324.
[4] JIN Xing-tong, LI Peng, WANG Gang, LIU Xiao-guang and LI Zhong-wei. Optimizing Small XOR-based Non-systematic Erasure Codes [J]. Computer Science, 2017, 44(6): 36-42.
[5] HAO Xin and GUO Shao-zhong. Optimization of 3D Finite Difference Algorithm on Intel MIC [J]. Computer Science, 2017, 44(5): 26-32.
[6] CHEN Yong and XU Chao. Symbolic Execution and Human-Machine Interaction Based Auto Vectorization Method [J]. Computer Science, 2016, 43(Z6): 461-466.
[7] YU Hai-ning, HAN Lin and LI Peng-yuan. Structure Optimization for Automatic Vectorization [J]. Computer Science, 2016, 43(2): 210-215.
[8] XU Jin-long ZHAO Rong-cai ZHAO Bo. Research on Non-full Length Usage of SIMD Vector Instruction [J]. Computer Science, 2015, 42(7): 229-233.
[9] SUN Hui-hui, ZHAO Rong-cai, GAO Wei and LI Yan-bing. Control Flow Vectorization Based on Conditions Classification [J]. Computer Science, 2015, 42(11): 240-247.
[10] GONG Qing-kui, ZHANG Chang-you, ZHANG Xian-yi and ZHANG Yun-quan. Primary Investigation into Parallel Computing in Julia Language [J]. Computer Science, 2015, 42(1): 44-46.
[11] XU Ying,LI Chun-jiang,DONG Yu-shan and ZHOU Si-qi. Implementation of Auto-vectorization Based on Directives in GCC [J]. Computer Science, 2014, 41(Z11): 364-367.
[12] LIU Peng,ZHAO Rong-cai,ZHAO Bo and GAO Wei. Unified Vectorization Framework for SIMD Extensions [J]. Computer Science, 2014, 41(9): 28-31.
[13] HOU Yong-sheng,ZHAO Rong-cai,HUANG Lei and HAN Lin. Research on SIMD-oriented Loop Optimizations [J]. Computer Science, 2014, 41(5): 27-32.
[14] ZHAO Bo,ZHAO Rong-cai,LI Yan-bing and GAO Wei. SLP Exploitation Method for Type Conversion Statements [J]. Computer Science, 2014, 41(11): 16-21.
[15] LI Chun-jiang,XU Ying,HUANG Juan-juan and YANG Can-qun. Formal Description of Design Space of SIMD Instruction Sets [J]. Computer Science, 2013, 40(6): 32-36.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!