计算机科学 ›› 2021, Vol. 48 ›› Issue (6): 26-33.doi: 10.11896/jsjkx.200400007

• 计算机体系结构* • 上一篇    下一篇

基于SIMD扩展部件的长向量超越函数实现方法

刘聃, 郭绍忠, 郝江伟, 许瑾晨   

  1. 信息工程大学数学工程与先进计算国家重点实验室 郑州450002
  • 收稿日期:2020-04-02 修回日期:2020-07-20 出版日期:2021-06-15 发布日期:2021-06-03
  • 通讯作者: 许瑾晨(atao728208@126.com)
  • 基金资助:
    国家自然科学基金项目(61802434)

Implementation of Transcendental Functions on Vectors Based on SIMD Extensions

LIU Dan, GUO Shao-zhong, HAO Jiang-wei, XU Jin-chen   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing,PLA Information Engineering University,Zhengzhou 450002,China
  • Received:2020-04-02 Revised:2020-07-20 Online:2021-06-15 Published:2021-06-03
  • About author:LIU Dan,born in 1988,postgraduate,is a student member of China Computer Federation.His main research interests include high-performance computing and so on.(liudanmath@foxmail.com)
    XU Jin-chen,born in 1987,Ph.D,lectu-rer,is a member of China Computer Fe-deration.His main research interests include high-performance computing and so on.
  • Supported by:
    National Natural Science Foundation of China(61802434).

摘要: 基础数学函数库是计算机系统非常关键的软件模块,然而国产申威平台上的长向量超越函数只能依靠循环调用系统标量函数来间接实现,该方法无法充分发挥申威平台SIMD扩展部件的计算性能。 为了有效解决此问题,实现了申威平台基于SIMD扩展部件底层优化的长向量超越函数,提出了浮点计算融合算法,解决了两分支结构算法难以向量化的问题;提出了基于Estrin算法动态分组的大阶数多项式实现方法,提高了多项式汇编计算的流水性能。 这是在国产申威平台上首次实现长向量超越函数库,提供的函数接口包含三角函数、反三角函数、对数函数、指数函数等。 实验结果表明,双精度版本最大误差控制在3.5ULP(unit in the last place)以下,单精度版本最大误差控制在0.5ULP以下,该性能与申威平台直接循环调用系统标量函数相比有显著提高,平均加速比为3.71。

关键词: 浮点计算, 国产平台, 基础数学库, 流水优化, 向量超越函数

Abstract: The basic mathematical function library is a critical soft module in the computer system.However,the long vector transcendental function on the domestic Shenwei platform can only be implemented indirectly by cyclic utilizing the system scalar function currently,thus limiting the computing capability of the SIMD extensions of Shenwei platform.In order to solve this problem effectively,this paper implements the long vector transcendental function based on lower-level optimization of SIMD extensions of Shenwei platform and proposes the floating-point computing fusion algorithm for solving the problem that the two-branch structure algorithm is difficult to vectorize.It also proposes the implementation method of higher degree polynomials based on the dynamic grouping of Estrin algorithm,which improves the pipelining performance of polynomial assembly evaluation.This is the first time to implement the long vector transcendental function library on the Shenwei platform.The providedfunction interfaces include trigonometric functions,inverse trigonometric functions,logarithmic functions,exponential functions,etc.The experimental result shows that the maximum error of double precision version is controlled below 3.5ULP (unit in the last place),and the maximum error of single precision version is controlled below 0.5ULP.Compared with the scalar function of Shenwei platform,the performance is significantly improved,and the average speedup ratio is 3.71.

Key words: Basic mathematic library, Domestic platform, Floating point calculation, Pipeline optimization, Vector transcendental function

中图分类号: 

  • TP311
[1]Intel Corporation.Intel© Math Kernel Library[OL].[2019-11-01].https://software.intel.com/en-us/mkl.
[2]Nara Institute of Science and Technology.SLEEF VectorizedMath Library[OL].[2019-11-01].https://sleef.org.
[3]SHIBATA N,PETROGALLI F.SLEEF:A Portable Vectorized Library of C Standard Mathematical Functions[J].IEEE Transactions on Parallel and Distributed Systems,2019(99):1316-1327.
[4]Free Software Foundation.The GNU C Library (glibc)[OL].[2019-11-01].https://www.gnu.org/software/libc/.
[5]PIPARO D,INNOCENTE V,HAUTHT.Speeding up HEP experiment software with a library of fast and auto-vectorisable mathematical functions[C]//20th International Conference on Computing in High Energy and Nuclear Physics.Bristol,United Kingdom:IOP,2013:20-27.
[6]LAUTER C.A new open-source SIMD vector libm fully implemented with high-level scalar C[C]//2016 50th Asilomar Conference on Signals,Systems and Computers.Piscataway,NJ:IEEE,2016:407-411.
[7]BRUNIE N,DINECHIN F D,KUPRIIANOVAO,et al.CodeGenerators for Mathematical Functions[C]//2015 IEEE 22nd Symposium on Computer Arithmetic (ARITH).Piscataway,NJ:IEEE,2015:66-73.
[8]LOW J Y L,JONG C C.A Memory-Efficient Tables-and-Additions Method for Accurate Computation of Elementary Functions[J].IEEE Transactions on Computers,2013,62(5):858-872.
[9]LASSUS H D,DEFOUR D,REVY G.Exact Lookup Tables for the Evaluation of Trigonometric and Hyperbolic Functions[J].IEEE Transactions on Computers,2017(99):1-14.
[10]EYERMAN S,SMITH J E,EECKHOUT L.Characterizing the branch misprediction penalty[C]//IEEE International Sympo-sium on Performance Analysis of Systems and Software.Pisca-taway,NJ:IEEE,2006:48-58.
[11]MULLER J M.Elementary functions:algorithms and implementation[M].Boston:birkhauser,2016:84-85.
[12]CHEVILLARD S,MIOARA J,LAUTER C.Sollya:An Environment for the Development of Numerical Codes[C]//International Congress on Mathematical Software.Berlin:Springer,2010:28-31.
[13]American National Standards Institute and Institute of Electrical and Electronic Engineers.IEEE Standard 754-2008[S].New York:American National Standards Institute,2008.
[14]Sun Microsystems,Inc.Freely Distributable LIBM[OL].[2019-11-01].http://www.netlib.org/fdlibm.
[15]TANG P P T.Table-driven implementation of the Expm1 function in IEEE floating-point arithmetic[J].ACM Transactions on Mathematical Software,1992,18(2):211-222.
[16]MICHEL M J,BRISEBARRE N,DED F.Handbook of floating-point arithmetic[M].Boston:Birkhauser,2010.
[17]SHUMAN D I,VANDERGHEYNST P,FROSSARD P.Chebyshev polynomial approximation for distributed signal processing[C]//2011 International Conference on Distributed Computing in Sensor Systems and Workshops (DCOSS).Piscataway,NJ:IEEE,2011:1-8.
[18]TAWFIK S A,FAHMY H A H.Algorithmic truncation ofMiniMax polynomial coefficients[C]//The IEEE International Symposium on Circuits and Systems.Piscataway,NJ:IEEE,2006:2421-2424.
[19]BRISEBARRE N,CHEVILLARD S.Efficient polynomial L-approximations[C]//IEEE Symposium on Computer Arithmetic.Piscataway,NJ:IEEE,2007:169-176.
[20]CHEVILLARD S,LAUTER C.A certified infinite norm for the implementation of elementary functions[C]//International Conference on Quality Software.Piscataway,NJ:IEEE,2007:152-160.
[21]RICARDO P,TREFETHEN L N.Barycentric-Remez algo-rithms for best polynomial approximation in the chebfun system[J].Bit Numerical Mathematics,2009,49(4):721-741.
[22]QIH Y,XU J C,GUO S Z.Detection of the maximum error of mathematical functions[J].The Journal of Supercomputing,2018,74(11):6275-6290.
[1] 曹浩, 郭绍忠, 刘聃, 许瑾晨.
面向64位RISC-V的基础数学库自动化移植
Automatic Porting of Basic Mathematics Library for 64-bit RISC-V
计算机科学, 2021, 48(6): 41-47. https://doi.org/10.11896/jsjkx.201200058
[2] 胡浩, 沈莉, 周清雷, 巩令钦.
基于LLVM编译器的节点融合优化方法
Node Fusion Optimization Method Based on LLVM Compiler
计算机科学, 2020, 47(6A): 561-566. https://doi.org/10.11896/JsJkx.191100017
[3] 陈莉 张兆庆 冯晓兵.
分布内存系统中节点间软流水优化技术

计算机科学, 2002, 29(11): 24-28.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!