计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 29-35.doi: 10.11896/jsjkx.201200135

• 计算机体系结构* 上一篇    下一篇

基于SIMD的三角函数高性能实现与优化

姚建宇1,2, 张祎维3, 张广婷1, 贾海鹏1   

  1. 1 中国科学院计算技术研究所计算机体系结构国家重点实验室 北京100190
    2 中国科学院大学计算机与控制学院 北京100049
    3 清华大学计算机科学与技术系 北京100084
  • 收稿日期:2020-12-15 修回日期:2021-04-21 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 张广婷(zhangguangting@ict.ac.cn)
  • 作者简介:yaojianyu19f@ict.ac.cn
  • 基金资助:
    国家重点研发计划(2017YFB0202502,2018YFC0809306,2017YFB0202105,2016YFB0200803,2017YFB0202302);国家自然科学基金(61972376);北京自然科学基金(L182053)

High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD

YAO Jian-yu1,2, ZHANG Yi-wei3, ZHANG Guang-ting1, JIA Hai-peng1   

  1. 1 State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
    2 School of Computer and Control Engineering,University of Chinese Academy of Sciences,Beijing 100049,China
    3 Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China
  • Received:2020-12-15 Revised:2021-04-21 Online:2021-12-15 Published:2021-11-26
  • About author:YAO Jian-yu,born in 1997,postgra-duate.His main research interests include high performance computing and parallel software,etc.
    ZHANG Guang-ting,born in 1987,MS,assistant professor,is a member of China Computer Federation.Her main research interests include high perfor-mance computing and parallel software.
  • Supported by:
    National Key R & D Program of China(2017YFB0202502,2018YFC0809306,2017YFB0202105,2016YFB0200803,2017YFB0202302),National Natural Science Foundation of China(61972376) and Beijing Natural Science Foundation of China(L182053).

摘要: 作为基本的数学运算,三角函数的高性能实现对构建处理器的基础软件生态具有重要意义,特别是当前处理器都采用了SIMD架构,基于SIMD实现高性能三角函数具有重要的研究意义和应用价值。对此,文中采用数值分析的方法,对5个常用的三角函数sin,cos,tan,atan,atan2进行了高性能的实现与优化。首先通过分析浮点数IEEE754标准,设计了高效的三角函数算法;然后通过多项式逼近算法中的泰勒公式、帕德近似及雷米兹算法提升了算法精度;最后利用指令流水线与SIMD优化进一步提升了算法性能。实验结果表明,在满足精度的前提下,所实现的三角函数,相较于libm算法库和ARM_M 算法库,在ARM V8计算平台上都获得了较大的性能提升,其中相比libm算法库有1.77~6.26倍的时间性能提升,相比ARM_M算法库有1.34~1.5倍的时间性能提升。

关键词: 三角函数, SIMD, 高性能, 数值分析, ARM V8架构

Abstract: As a basic mathematical operation,the high-performance implementation of trigonometric functions is of great significance to the construction of the basic software ecology of the processor.Especially,the current processors have adopted the SIMD architecture,and the implementation of high-performance trigonometric functions based on SIMD has important research significance and application value.In this regard,this paper uses numerical analysis method to implement and optimize the five commonly used trigonometric functions sin,cos,tan,atan,atan2 with high performance.Based on the analysis of floating-point IEEE754 standard,an efficient trigonometric function algorithm is designed.Then,the algorithm accuracy is further improved by the application of Taylor formula,Pade approximation and Remez algorithm in polynomial approximation algorithm.Finally,the perfor-mance of the algorithm is further improved by using instruction pipeline and SIMD optimization.The experimental results show that,on the premise of satisfying the accuracy,the trigonometric function implemented is compared with libm algorithm library and ARM_M algorithm library,on the ARM V8 computing platform,has achieved great performance improvement,whose time performance is 1.77~6.26 times higher than libm algorithm library,and compared with ARM_M,its times performance is 1.34~1.5 times higher.

Key words: Trigonometric function, SIMD, High performance, Numerical analysis, ARM V8 architecture

中图分类号: 

  • TP391
[1]FU S Y,WU J J,HSU W C.Improving SIMD code generation in QEMU[C]//2015 Design,Automation & Test in Europe Conference & Exhibition(DATE).IEEE,2015:1233-1236.
[2]SHIBATA N.Efficient evaluation methods of elementary functions suitable for SIMD computation[J].Computer Science-Research and Development,2010,25(1):25-32.
[3]STEPHENS N,BILES S,BOETTCHER M,et al.The ARM scalable vector extension[J].IEEE Micro,2017,37(2):26-39.
[4]CHEN S M,GUO S Z,CHEN J X,et al.Optimization Algorithm for Trigonometric Functions Based on Processor with SIMD Function Components[J].Journal of Information Engineering University,2011,12(1):103-106.
[5]CAO D,GUO S Z,ZHANG X.Implementation and Optimization of Extended Function Library Based on SW26010 Processor[J].Computer Engineering,2017(1):61-66.
[6]LI Q Y,WANG N C,YI D Y.Numerical analysis 5th edition[M].Tsinghua University Press,2008.
[7]HACKBUSCH W.Computation of best $$ L{\infty} $$ L∞ exponential sums for 1/x by Remez' algorithm[J].Computing and Visualization in Science,2019,20(1):1-11.
[8]GLUZMAN S,YUKALOV V I.Self-similarly corrected Pade approximants for nonlinear equations[J].International Journal of Modern Physics B,2019,33(29):1950353.
[9]HE Z,ZHANG J,YAO Z.Determining the optimal coefficients of the explicit finite difference scheme using the Remez exchange algorithm[J].Geophysics,2019,84(3):S137-S147.
[10]MACCHIARELLA G.“Equi-Ripple” Synthesis of Multiband Prototype Filters Using a Remez-Like Algorithm[J].IEEE Microwave and Wireless Components Letters,2013,23(5):231-233.
[11]VINSCHEN C,JOHNSTON J.Standard C math library[OL].https://www.sourceware.org/newlib/libm.html.
[12]ARM.Arm Performance Libraries Reference Guide[OL]. https://static.docs.arm.com/101004/1920/arm_performance_libraries_reference_101004_1920_00_en.pdf.
[13]ARM.Arm Optimized Routines[OL].https://github.com/ARM-software/optimized-routines.
[14]ZHA Y L.Qin Jiushao's mathematical thinking method[J].Research on Dialectics of Nature,2003,19(1):87-92.
[1] 蒋化南, 张帅, 林宇斐, 李豪. 基于MPI的分布式并行Gazebo仿真优化与测试[J]. 计算机科学, 2021, 48(11A): 672-677.
[2] 李爽, 赵荣彩, 王磊. 面向申威1621通用矩阵乘算法的实现与优化[J]. 计算机科学, 2021, 48(11A): 699-704.
[3] 陈国良, 张玉杰. 并行计算学科发展历程[J]. 计算机科学, 2020, 47(8): 1-4.
[4] 蔡于涵,熊淑华,孙伟恒,Karn Pradeep,何小海. 基于运动矢量细化的帧率上变换与HEVC结合的视频压缩算法[J]. 计算机科学, 2020, 47(2): 76-82.
[5] 汪洋, 李鹏, 季一木, 樊卫北, 张玉杰, 王汝传, 陈国良. 高性能计算与天文大数据研究综述[J]. 计算机科学, 2020, 47(1): 1-6.
[6] 徐传福,王曦,刘舒,陈世钊,林玉. 基于Python的大规模高性能LBM多相流模拟[J]. 计算机科学, 2020, 47(1): 17-23.
[7] 龚彤艳,张广婷,贾海鹏,袁良. 一种偶数基Cooley-Tukey FFT高性能实现方法[J]. 计算机科学, 2020, 47(1): 31-39.
[8] 颜辉, 朱伯靖, 万文, 钟英, DavidAYune. 基于超算暨HPIC-LBM的大时空尺度三维湍流磁重联[J]. 计算机科学, 2019, 46(8): 89-94.
[9] 贾迅, 钱磊, 邬贵明, 吴东, 谢向辉. FPGA应用于高性能计算的研究现状和未来挑战[J]. 计算机科学, 2019, 46(11): 11-19.
[10] 张云泉. 2018年中国高性能计算机发展现状分析与展望[J]. 计算机科学, 2019, 46(1): 1-5.
[11] 周蓓, 黄永忠, 许瑾晨, 郭绍忠. 向量数学库的向量化方法研究[J]. 计算机科学, 2019, 46(1): 320-324.
[12] 单娜娜,周巍,段哲民. HEVC中的变换系数熵编码优化算法[J]. 计算机科学, 2017, 44(6): 290-293.
[13] 金星彤,李鹏,王刚,刘晓光,李忠伟. 基于异或的隐私保护码优化研究[J]. 计算机科学, 2017, 44(6): 36-42.
[14] 杨玉星,邱亚娜. k元n维冒泡排序网络的子网排除[J]. 计算机科学, 2017, 44(11): 264-267.
[15] 司雨濛,韦建文,Simon SEE,林新华. 星系分组算法的并行设计与优化:SGI系统与分布式集群对比[J]. 计算机科学, 2017, 44(10): 80-84.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 侯晓媛,王显荣,李华,沈维维. U2TP到TTCN-3自动转换的研究与实现[J]. 计算机科学, 2014, 41(Z6): 433 -437 .
[2] 徐昕,顾云丽,钱焕延. 基于链路质量的无线传感器网络任播路由协议[J]. 计算机科学, 2010, 37(4): 99 .
[3] 蔡震震, 唐鹏, 胡建斌, 金炜东. 深度卷积神经网络实现硬性渗出的自动检测[J]. 计算机科学, 2018, 45(11A): 203 -207 .
[4] 田维维, 周悦, 尹旺, 何凌, 邓丽华, 李元媛. 结合EHHT和CI的精神分裂症语音自动检测算法[J]. 计算机科学, 2020, 47(6A): 187 -195 .
[5] 潘孝勤, 芦天亮, 杜彦辉, 仝鑫. 基于深度学习的语音合成与转换技术综述[J]. 计算机科学, 2021, 48(8): 200 -208 .
[6] 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究[J]. 计算机科学, 2021, 48(9): 36 -42 .
[7] 余力, 杜启翰, 岳博妍, 向君瑶, 徐冠宇, 冷友方. 基于强化学习的推荐研究综述[J]. 计算机科学, 2021, 48(10): 1 -18 .
[8] 王梓强, 胡晓光, 李晓筱, 杜卓群. 移动机器人全局路径规划算法综述[J]. 计算机科学, 2021, 48(10): 19 -29 .
[9] 高洪皓, 郑子彬, 殷昱煜, 丁勇. 区块链技术专题序言[J]. 计算机科学, 2021, 48(11): 1 -3 .
[10] 毛瀚宇, 聂铁铮, 申德荣, 于戈, 徐石成, 何光宇. 区块链即服务平台关键技术及发展综述[J]. 计算机科学, 2021, 48(11): 4 -11 .