计算机科学 ›› 2021, Vol. 48 ›› Issue (6): 26-33.doi: 10.11896/jsjkx.200400007

• 计算机体系结构* • 上一篇    下一篇

基于SIMD扩展部件的长向量超越函数实现方法

刘聃, 郭绍忠, 郝江伟, 许瑾晨   

  1. 信息工程大学数学工程与先进计算国家重点实验室 郑州450002
  • 收稿日期:2020-04-02 修回日期:2020-07-20 出版日期:2021-06-15 发布日期:2021-06-03
  • 通讯作者: 许瑾晨(atao728208@126.com)
  • 基金资助:
    国家自然科学基金项目(61802434)

Implementation of Transcendental Functions on Vectors Based on SIMD Extensions

LIU Dan, GUO Shao-zhong, HAO Jiang-wei, XU Jin-chen   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing,PLA Information Engineering University,Zhengzhou 450002,China
  • Received:2020-04-02 Revised:2020-07-20 Online:2021-06-15 Published:2021-06-03
  • About author:LIU Dan,born in 1988,postgraduate,is a student member of China Computer Federation.His main research interests include high-performance computing and so on.(liudanmath@foxmail.com)
    XU Jin-chen,born in 1987,Ph.D,lectu-rer,is a member of China Computer Fe-deration.His main research interests include high-performance computing and so on.
  • Supported by:
    National Natural Science Foundation of China(61802434).

摘要: 基础数学函数库是计算机系统非常关键的软件模块,然而国产申威平台上的长向量超越函数只能依靠循环调用系统标量函数来间接实现,该方法无法充分发挥申威平台SIMD扩展部件的计算性能。 为了有效解决此问题,实现了申威平台基于SIMD扩展部件底层优化的长向量超越函数,提出了浮点计算融合算法,解决了两分支结构算法难以向量化的问题;提出了基于Estrin算法动态分组的大阶数多项式实现方法,提高了多项式汇编计算的流水性能。 这是在国产申威平台上首次实现长向量超越函数库,提供的函数接口包含三角函数、反三角函数、对数函数、指数函数等。 实验结果表明,双精度版本最大误差控制在3.5ULP(unit in the last place)以下,单精度版本最大误差控制在0.5ULP以下,该性能与申威平台直接循环调用系统标量函数相比有显著提高,平均加速比为3.71。

关键词: 基础数学库, 向量超越函数, 国产平台, 流水优化, 浮点计算

Abstract: The basic mathematical function library is a critical soft module in the computer system.However,the long vector transcendental function on the domestic Shenwei platform can only be implemented indirectly by cyclic utilizing the system scalar function currently,thus limiting the computing capability of the SIMD extensions of Shenwei platform.In order to solve this problem effectively,this paper implements the long vector transcendental function based on lower-level optimization of SIMD extensions of Shenwei platform and proposes the floating-point computing fusion algorithm for solving the problem that the two-branch structure algorithm is difficult to vectorize.It also proposes the implementation method of higher degree polynomials based on the dynamic grouping of Estrin algorithm,which improves the pipelining performance of polynomial assembly evaluation.This is the first time to implement the long vector transcendental function library on the Shenwei platform.The providedfunction interfaces include trigonometric functions,inverse trigonometric functions,logarithmic functions,exponential functions,etc.The experimental result shows that the maximum error of double precision version is controlled below 3.5ULP (unit in the last place),and the maximum error of single precision version is controlled below 0.5ULP.Compared with the scalar function of Shenwei platform,the performance is significantly improved,and the average speedup ratio is 3.71.

Key words: Basic mathematic library, Vector transcendental function, Domestic platform, Pipeline optimization, Floating point calculation

中图分类号: 

  • TP311
[1] Intel Corporation.Intel© Math Kernel Library[OL].[2019-11-01].https://software.intel.com/en-us/mkl.
[2] Nara Institute of Science and Technology.SLEEF VectorizedMath Library[OL].[2019-11-01].https://sleef.org.
[3] SHIBATA N,PETROGALLI F.SLEEF:A Portable Vectorized Library of C Standard Mathematical Functions[J].IEEE Transactions on Parallel and Distributed Systems,2019(99):1316-1327.
[4] Free Software Foundation.The GNU C Library (glibc)[OL].[2019-11-01].https://www.gnu.org/software/libc/.
[5] PIPARO D,INNOCENTE V,HAUTHT.Speeding up HEP experiment software with a library of fast and auto-vectorisable mathematical functions[C]//20th International Conference on Computing in High Energy and Nuclear Physics.Bristol,United Kingdom:IOP,2013:20-27.
[6] LAUTER C.A new open-source SIMD vector libm fully implemented with high-level scalar C[C]//2016 50th Asilomar Conference on Signals,Systems and Computers.Piscataway,NJ:IEEE,2016:407-411.
[7] BRUNIE N,DINECHIN F D,KUPRIIANOVAO,et al.CodeGenerators for Mathematical Functions[C]//2015 IEEE 22nd Symposium on Computer Arithmetic (ARITH).Piscataway,NJ:IEEE,2015:66-73.
[8] LOW J Y L,JONG C C.A Memory-Efficient Tables-and-Additions Method for Accurate Computation of Elementary Functions[J].IEEE Transactions on Computers,2013,62(5):858-872.
[9] LASSUS H D,DEFOUR D,REVY G.Exact Lookup Tables for the Evaluation of Trigonometric and Hyperbolic Functions[J].IEEE Transactions on Computers,2017(99):1-14.
[10] EYERMAN S,SMITH J E,EECKHOUT L.Characterizing the branch misprediction penalty[C]//IEEE International Sympo-sium on Performance Analysis of Systems and Software.Pisca-taway,NJ:IEEE,2006:48-58.
[11] MULLER J M.Elementary functions:algorithms and implementation[M].Boston:birkhauser,2016:84-85.
[12] CHEVILLARD S,MIOARA J,LAUTER C.Sollya:An Environment for the Development of Numerical Codes[C]//International Congress on Mathematical Software.Berlin:Springer,2010:28-31.
[13] American National Standards Institute and Institute of Electrical and Electronic Engineers.IEEE Standard 754-2008[S].New York:American National Standards Institute,2008.
[14] Sun Microsystems,Inc.Freely Distributable LIBM[OL].[2019-11-01].http://www.netlib.org/fdlibm.
[15] TANG P P T.Table-driven implementation of the Expm1 function in IEEE floating-point arithmetic[J].ACM Transactions on Mathematical Software,1992,18(2):211-222.
[16] MICHEL M J,BRISEBARRE N,DED F.Handbook of floating-point arithmetic[M].Boston:Birkhauser,2010.
[17] SHUMAN D I,VANDERGHEYNST P,FROSSARD P.Chebyshev polynomial approximation for distributed signal processing[C]//2011 International Conference on Distributed Computing in Sensor Systems and Workshops (DCOSS).Piscataway,NJ:IEEE,2011:1-8.
[18] TAWFIK S A,FAHMY H A H.Algorithmic truncation ofMiniMax polynomial coefficients[C]//The IEEE International Symposium on Circuits and Systems.Piscataway,NJ:IEEE,2006:2421-2424.
[19] BRISEBARRE N,CHEVILLARD S.Efficient polynomial L-approximations[C]//IEEE Symposium on Computer Arithmetic.Piscataway,NJ:IEEE,2007:169-176.
[20] CHEVILLARD S,LAUTER C.A certified infinite norm for the implementation of elementary functions[C]//International Conference on Quality Software.Piscataway,NJ:IEEE,2007:152-160.
[21] RICARDO P,TREFETHEN L N.Barycentric-Remez algo-rithms for best polynomial approximation in the chebfun system[J].Bit Numerical Mathematics,2009,49(4):721-741.
[22] QIH Y,XU J C,GUO S Z.Detection of the maximum error of mathematical functions[J].The Journal of Supercomputing,2018,74(11):6275-6290.
[1] 曹浩, 郭绍忠, 刘聃, 许瑾晨. 面向64位RISC-V的基础数学库自动化移植[J]. 计算机科学, 2021, 48(6): 41-47.
[2] 胡浩, 沈莉, 周清雷, 巩令钦. 基于LLVM编译器的节点融合优化方法[J]. 计算机科学, 2020, 47(6A): 561-566.
[3] 陈莉 张兆庆 冯晓兵. 分布内存系统中节点间软流水优化技术[J]. 计算机科学, 2002, 29(11): 24-28.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[8] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[9] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .
[10] 施超,谢在鹏,柳晗,吕鑫. 基于稳定匹配的容器部署策略的优化[J]. 计算机科学, 2018, 45(4): 131 -136 .