计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 699-704.doi: 10.11896/jsjkx.201200150

• 交叉& 应用 • 上一篇    下一篇

面向申威1621通用矩阵乘算法的实现与优化

李爽, 赵荣彩, 王磊   

  1. 中原工学院计算机学院 郑州450007
    中原工学院前沿信息技术研究院 郑州450007
    河南省网络舆情监测与智能分析重点实验室 郑州450007
  • 出版日期:2021-11-10 发布日期:2021-11-12
  • 通讯作者: 王磊(wl1167@163.com)
  • 作者简介:cbcvvv@qq.com

Implementation and Optimization of Sunway1621 General Matrix Multiplication Algorithm

LI Shuang, ZHAO Rong-cai, WANG Lei   

  1. School of Computer Science,Zhongyuan University of Technology,Zhengzhou 450007,China
    Research Institute of Front Information Technology,Zhongyuan University of Technology,Zhengzhou 450007,China
    Henan Key Laboratory on Public Opinion Intelligent Analysis,Zhengzhou 450007,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:LI Shuang,born in 1992,postgraduate.Her main research interests include high performance computing and so on.
    WANG Lei,born in 1977,master,professor,master's tutor.His main research interests include high performance computing and basic mathematics function library design and optimization research and development.

摘要: BLAS库作为高性能计算中最基本的数学库,对高性能计算机平台上的数值计算、人工智能等领域应用都起着重要作用。BLAS3级函数GEMM是整个BLAS库性能的核心指标。目前,还没有能够充分发挥申威1621平台优势的高性能BLAS库。针对上述问题,在申威1621平台上,实现了GotoBLAS的移植与优化。提出了一种使用SIMD向量化进行核心代码优化的算法实现,为满足向量优化的算法实现分别进行了数据重排、计算数据块选择、浮点寄存器分配、向量化指令改写等优化技术。分别比较了SGEMM和DGEMM在Micro-kernel中使用cache行和使用向量化优化的最优数据块选择方案。实验结果表明,优化后最佳分块下的SGEMM单核性能比GotoBLAS单核单精度浮点数平均加速52.09倍,DGEMM单核性能比GotoBLAS单核双精度浮点数平均加速32.75倍。

关键词: GEMM, SIMD, 程序优化, 申威1621, 算法实现

Abstract: As the most basic library in high performance computing (HPC),BLAS plays an important role in scientific computation,AI and other applications.The GEMM-based level 3 BLAS is the core of the performance of the entire BLAS.At present,there is no high-performance BLAS library that can give full play to the advantages of Sunway1621.Aiming at the above problems,we realize the transplantation and optimization of GotoBLAS on Sunway1621.This paper presents an algorithm for core code optimization using SIMD vectorization,and performs optimization techniques such as data regrouping,blocking,register allocation,and vectorization instruction optimization.The optimal data block selection scheme using vectorization and cache-based optimization for SGEMM and DGEMM in Micro-Kernel is compared respectively.Our optimizations achieve an average speedup of 52.09X and 32.75X on single precision and double precision compared to GotoBLAS.

Key words: Algorithm implementation, GEMM, Program optimization, SIMD, Sunway1621

中图分类号: 

  • TP319
[1]GOTO K,GEIJN R A.Anatomy of high-performance matrixmultiplication[J].ACM Transactions on Mathematical Software (TOMS),2008,34(3):1-25.
[2]ZHANG X Y,WANG Q,ZHANG Y Q.Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor[C]//2012 IEEE 18th International Conference on Parallel and Distributed Systems.Singapore,2012:684-691.
[3]WANG E,ZHANG Q,SHEN B,et al.Intel math kernel library[M].High-Performance Computing on the Intel© Xeon Phi-.Springer,Cham,2014:167-188.
[4]AMD.2012.AMD Core Math Library[OL].http://developer.amd.com/tools/cpu/acml/pages/default.aspx.
[5]cuBLAS.Basic Linear Algebra on NVIDIA GPUs[OL].https://developer.nvidia.com/cublas.
[6]GOTO K,VAN DE GEIJN R.High-performance implementa-tion of the level-3 BLAS[J].ACM Transactions on Mathematical Software (TOMS),2008,35(1):1-14.
[7]JIANG M Q,ZHANG Y Q,SONG G,et al.Research on High Performance Implementation Mechanism of GOTOBLAS General Matrix-matrix Multiplication[J].Computer Engineering,2008(7):84-86,103.
[8]LIU H,LIU F F,ZHANG P,et al.Optimization of BLAS Level 3 Functions on SW1600[J].Computer System Application,2016,25(12):234-239.
[9]LIU Z,TIAN X.Vectorization of Matrix Multiplication forMulti-core Vector Processors[J].Chinese Journal of Compu-ters,2018,41(10):2251-2264.
[10]VAN ZEE F G,SMITH T M.Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods[J].ACM Transactions on Mathematical Software,2017,44(1):1-36.
[11]KIM K,COSTA T B,DEVECIM,et al.Designing vector-friendly compact BLAS and LAPACK kernels[C]//IEEE International Conference on High Performance Computing Data and Analytics.2017.
[12]Chengdu Sunway Technology Corporation Limited.2017.Sun-way1621 processor structure manual[OL].http://www.swcpu.cn/uploadfile/2018/0709/20180709030836489.pdf.
[1] 姚建宇, 张祎维, 张广婷, 贾海鹏.
基于SIMD的三角函数高性能实现与优化
High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD
计算机科学, 2021, 48(12): 29-35. https://doi.org/10.11896/jsjkx.201200135
[2] 龚彤艳,张广婷,贾海鹏,袁良.
一种偶数基Cooley-Tukey FFT高性能实现方法
High-performance Implementation Method for Even Basis of Cooley-Tukey FFT
计算机科学, 2020, 47(1): 31-39. https://doi.org/10.11896/jsjkx.190900179
[3] 周蓓, 黄永忠, 许瑾晨, 郭绍忠.
向量数学库的向量化方法研究
Study on SIMD Method of Vector Math Library
计算机科学, 2019, 46(1): 320-324. https://doi.org/10.11896/j.issn.1002-137X.2019.01.050
[4] 金星彤,李鹏,王刚,刘晓光,李忠伟.
基于异或的隐私保护码优化研究
Optimizing Small XOR-based Non-systematic Erasure Codes
计算机科学, 2017, 44(6): 36-42. https://doi.org/10.11896/j.issn.1002-137X.2017.06.006
[5] 于海宁,韩林,李鹏远.
面向自动向量化的结构体优化
Structure Optimization for Automatic Vectorization
计算机科学, 2016, 43(2): 210-215. https://doi.org/10.11896/j.issn.1002-137X.2016.02.045
[6] 徐金龙 赵荣彩 赵 博.
SIMD向量指令的非满载使用方法研究
Research on Non-full Length Usage of SIMD Vector Instruction
计算机科学, 2015, 42(7): 229-233. https://doi.org/10.11896/j.issn.1002-137X.2015.07.049
[7] 徐金龙,赵荣彩,徐晓燕.
SIMD代码中的向量访存优化研究
Memory Access Optimization for Vector Program of SIMD Form
计算机科学, 2015, 42(12): 18-22.
[8] 孙回回,赵荣彩,高伟,李雁冰.
基于条件分类的控制流向量化
Control Flow Vectorization Based on Conditions Classification
计算机科学, 2015, 42(11): 240-247. https://doi.org/10.11896/j.issn.1002-137X.2015.11.049
[9] 徐颖,李春江,董钰山,周思齐.
GCC编译器中编译指导的自动向量化实现
Implementation of Auto-vectorization Based on Directives in GCC
计算机科学, 2014, 41(Z11): 364-367.
[10] 侯永生,赵荣彩,黄磊,韩林.
面向SIMD扩展部件的循环优化研究
Research on SIMD-oriented Loop Optimizations
计算机科学, 2014, 41(5): 27-32. https://doi.org/10.11896/j.issn.1002-137X.2014.05.006
[11] 赵博,赵荣彩,李雁冰,高伟.
类型转换语句的SLP发掘方法
SLP Exploitation Method for Type Conversion Statements
计算机科学, 2014, 41(11): 16-21. https://doi.org/10.11896/j.issn.1002-137X.2014.11.004
[12] 李春江,徐颖,黄娟娟,杨灿群.
SIMD指令集设计空间的形式化描述
Formal Description of Design Space of SIMD Instruction Sets
计算机科学, 2013, 40(6): 32-36.
[13] 何军,黄永勤,朱英.
基于SIMD部件的四倍精度浮点乘加器设计
Design of Quadruple Precision Floating-point Fused Multiply-Add Unit Based on SIMD Device
计算机科学, 2013, 40(12): 15-18.
[14] 敖富江,杜静,马孝尊,汪连栋.
高性能并行仿真中程序与平台之间的适用性研究
Research on the Applicability between Program and Platform in High Performance Simulation
计算机科学, 2012, 39(Z6): 444-448.
[15] 魏帅,赵荣彩,姚远,侯永生.
面向SIMD的数组重组和对齐优化
Data Regroup and Alignment Optimization Based on SIMD
计算机科学, 2012, 39(2): 305-310.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!