面向申威1621通用矩阵乘算法的实现与优化

doi:10.11896/jsjkx.201200150

计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 699-704.doi: 10.11896/jsjkx.201200150

面向申威1621通用矩阵乘算法的实现与优化

李爽, 赵荣彩, 王磊

中原工学院计算机学院郑州450007
中原工学院前沿信息技术研究院郑州450007
河南省网络舆情监测与智能分析重点实验室郑州450007

出版日期:2021-11-10 发布日期:2021-11-12
通讯作者: 王磊(wl1167@163.com)
作者简介:cbcvvv@qq.com

Implementation and Optimization of Sunway1621 General Matrix Multiplication Algorithm

LI Shuang, ZHAO Rong-cai, WANG Lei

School of Computer Science,Zhongyuan University of Technology,Zhengzhou 450007,China
Research Institute of Front Information Technology,Zhongyuan University of Technology,Zhengzhou 450007,China
Henan Key Laboratory on Public Opinion Intelligent Analysis,Zhengzhou 450007,China

Online:2021-11-10 Published:2021-11-12
About author:LI Shuang,born in 1992,postgraduate.Her main research interests include high performance computing and so on.
WANG Lei,born in 1977,master,professor,master's tutor.His main research interests include high performance computing and basic mathematics function library design and optimization research and development.

摘要/Abstract

摘要： BLAS库作为高性能计算中最基本的数学库,对高性能计算机平台上的数值计算、人工智能等领域应用都起着重要作用。BLAS3级函数GEMM是整个BLAS库性能的核心指标。目前,还没有能够充分发挥申威1621平台优势的高性能BLAS库。针对上述问题,在申威1621平台上,实现了GotoBLAS的移植与优化。提出了一种使用SIMD向量化进行核心代码优化的算法实现,为满足向量优化的算法实现分别进行了数据重排、计算数据块选择、浮点寄存器分配、向量化指令改写等优化技术。分别比较了SGEMM和DGEMM在Micro-kernel中使用cache行和使用向量化优化的最优数据块选择方案。实验结果表明,优化后最佳分块下的SGEMM单核性能比GotoBLAS单核单精度浮点数平均加速52.09倍,DGEMM单核性能比GotoBLAS单核双精度浮点数平均加速32.75倍。

关键词: GEMM, SIMD, 程序优化, 申威1621, 算法实现

Abstract: As the most basic library in high performance computing (HPC),BLAS plays an important role in scientific computation,AI and other applications.The GEMM-based level 3 BLAS is the core of the performance of the entire BLAS.At present,there is no high-performance BLAS library that can give full play to the advantages of Sunway1621.Aiming at the above problems,we realize the transplantation and optimization of GotoBLAS on Sunway1621.This paper presents an algorithm for core code optimization using SIMD vectorization,and performs optimization techniques such as data regrouping,blocking,register allocation,and vectorization instruction optimization.The optimal data block selection scheme using vectorization and cache-based optimization for SGEMM and DGEMM in Micro-Kernel is compared respectively.Our optimizations achieve an average speedup of 52.09X and 32.75X on single precision and double precision compared to GotoBLAS.

Key words: Algorithm implementation, GEMM, Program optimization, SIMD, Sunway1621

中图分类号:

TP319

李爽, 赵荣彩, 王磊. 面向申威1621通用矩阵乘算法的实现与优化[J]. 计算机科学, 2021, 48(11A): 699-704. https://doi.org/10.11896/jsjkx.201200150

LI Shuang, ZHAO Rong-cai, WANG Lei. Implementation and Optimization of Sunway1621 General Matrix Multiplication Algorithm[J]. Computer Science, 2021, 48(11A): 699-704. https://doi.org/10.11896/jsjkx.201200150

参考文献

[1]GOTO K,GEIJN R A.Anatomy of high-performance matrixmultiplication[J].ACM Transactions on Mathematical Software (TOMS),2008,34(3):1-25.
[2]ZHANG X Y,WANG Q,ZHANG Y Q.Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor[C]//2012 IEEE 18th International Conference on Parallel and Distributed Systems.Singapore,2012:684-691.
[3]WANG E,ZHANG Q,SHEN B,et al.Intel math kernel library[M].High-Performance Computing on the Intel© Xeon Phi-.Springer,Cham,2014:167-188.
[4]AMD.2012.AMD Core Math Library[OL].http://developer.amd.com/tools/cpu/acml/pages/default.aspx.
[5]cuBLAS.Basic Linear Algebra on NVIDIA GPUs[OL].https://developer.nvidia.com/cublas.
[6]GOTO K,VAN DE GEIJN R.High-performance implementa-tion of the level-3 BLAS[J].ACM Transactions on Mathematical Software (TOMS),2008,35(1):1-14.
[7]JIANG M Q,ZHANG Y Q,SONG G,et al.Research on High Performance Implementation Mechanism of GOTOBLAS General Matrix-matrix Multiplication[J].Computer Engineering,2008(7):84-86,103.
[8]LIU H,LIU F F,ZHANG P,et al.Optimization of BLAS Level 3 Functions on SW1600[J].Computer System Application,2016,25(12):234-239.
[9]LIU Z,TIAN X.Vectorization of Matrix Multiplication forMulti-core Vector Processors[J].Chinese Journal of Compu-ters,2018,41(10):2251-2264.
[10]VAN ZEE F G,SMITH T M.Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods[J].ACM Transactions on Mathematical Software,2017,44(1):1-36.
[11]KIM K,COSTA T B,DEVECIM,et al.Designing vector-friendly compact BLAS and LAPACK kernels[C]//IEEE International Conference on High Performance Computing Data and Analytics.2017.
[12]Chengdu Sunway Technology Corporation Limited.2017.Sun-way1621 processor structure manual[OL].http://www.swcpu.cn/uploadfile/2018/0709/20180709030836489.pdf.

相关文章 15

[1]	姚建宇, 张祎维, 张广婷, 贾海鹏. 基于SIMD的三角函数高性能实现与优化 High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD 计算机科学, 2021, 48(12): 29-35. https://doi.org/10.11896/jsjkx.201200135
[2]	龚彤艳,张广婷,贾海鹏,袁良. 一种偶数基Cooley-Tukey FFT高性能实现方法 High-performance Implementation Method for Even Basis of Cooley-Tukey FFT 计算机科学, 2020, 47(1): 31-39. https://doi.org/10.11896/jsjkx.190900179
[3]	周蓓, 黄永忠, 许瑾晨, 郭绍忠. 向量数学库的向量化方法研究 Study on SIMD Method of Vector Math Library 计算机科学, 2019, 46(1): 320-324. https://doi.org/10.11896／j.issn.1002-137X.2019.01.050
[4]	金星彤,李鹏,王刚,刘晓光,李忠伟. 基于异或的隐私保护码优化研究 Optimizing Small XOR-based Non-systematic Erasure Codes 计算机科学, 2017, 44(6): 36-42. https://doi.org/10.11896/j.issn.1002-137X.2017.06.006
[5]	于海宁,韩林,李鹏远. 面向自动向量化的结构体优化 Structure Optimization for Automatic Vectorization 计算机科学, 2016, 43(2): 210-215. https://doi.org/10.11896/j.issn.1002-137X.2016.02.045
[6]	徐金龙赵荣彩赵博. SIMD向量指令的非满载使用方法研究 Research on Non-full Length Usage of SIMD Vector Instruction 计算机科学, 2015, 42(7): 229-233. https://doi.org/10.11896/j.issn.1002-137X.2015.07.049
[7]	徐金龙,赵荣彩,徐晓燕. SIMD代码中的向量访存优化研究 Memory Access Optimization for Vector Program of SIMD Form 计算机科学, 2015, 42(12): 18-22.
[8]	孙回回,赵荣彩,高伟,李雁冰. 基于条件分类的控制流向量化 Control Flow Vectorization Based on Conditions Classification 计算机科学, 2015, 42(11): 240-247. https://doi.org/10.11896/j.issn.1002-137X.2015.11.049
[9]	徐颖,李春江,董钰山,周思齐. GCC编译器中编译指导的自动向量化实现 Implementation of Auto-vectorization Based on Directives in GCC 计算机科学, 2014, 41(Z11): 364-367.
[10]	侯永生,赵荣彩,黄磊,韩林. 面向SIMD扩展部件的循环优化研究 Research on SIMD-oriented Loop Optimizations 计算机科学, 2014, 41(5): 27-32. https://doi.org/10.11896/j.issn.1002-137X.2014.05.006
[11]	赵博,赵荣彩,李雁冰,高伟. 类型转换语句的SLP发掘方法 SLP Exploitation Method for Type Conversion Statements 计算机科学, 2014, 41(11): 16-21. https://doi.org/10.11896/j.issn.1002-137X.2014.11.004
[12]	李春江,徐颖,黄娟娟,杨灿群. SIMD指令集设计空间的形式化描述 Formal Description of Design Space of SIMD Instruction Sets 计算机科学, 2013, 40(6): 32-36.
[13]	何军,黄永勤,朱英. 基于SIMD部件的四倍精度浮点乘加器设计 Design of Quadruple Precision Floating-point Fused Multiply-Add Unit Based on SIMD Device 计算机科学, 2013, 40(12): 15-18.
[14]	敖富江，杜静，马孝尊，汪连栋. 高性能并行仿真中程序与平台之间的适用性研究 Research on the Applicability between Program and Platform in High Performance Simulation 计算机科学, 2012, 39(Z6): 444-448.
[15]	魏帅，赵荣彩，姚远，侯永生. 面向SIMD的数组重组和对齐优化 Data Regroup and Alignment Optimization Based on SIMD 计算机科学, 2012, 39(2): 305-310.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

面向申威1621通用矩阵乘算法的实现与优化

Implementation and Optimization of Sunway1621 General Matrix Multiplication Algorithm

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0