计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 29-35.doi: 10.11896/jsjkx.201200135

• 计算机体系结构* 上一篇    下一篇

基于SIMD的三角函数高性能实现与优化

姚建宇1,2, 张祎维3, 张广婷1, 贾海鹏1   

  1. 1 中国科学院计算技术研究所计算机体系结构国家重点实验室 北京100190
    2 中国科学院大学计算机与控制学院 北京100049
    3 清华大学计算机科学与技术系 北京100084
  • 收稿日期:2020-12-15 修回日期:2021-04-21 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 张广婷(zhangguangting@ict.ac.cn)
  • 作者简介:yaojianyu19f@ict.ac.cn
  • 基金资助:
    国家重点研发计划(2017YFB0202502,2018YFC0809306,2017YFB0202105,2016YFB0200803,2017YFB0202302);国家自然科学基金(61972376);北京自然科学基金(L182053)

High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD

YAO Jian-yu1,2, ZHANG Yi-wei3, ZHANG Guang-ting1, JIA Hai-peng1   

  1. 1 State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
    2 School of Computer and Control Engineering,University of Chinese Academy of Sciences,Beijing 100049,China
    3 Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China
  • Received:2020-12-15 Revised:2021-04-21 Online:2021-12-15 Published:2021-11-26
  • About author:YAO Jian-yu,born in 1997,postgra-duate.His main research interests include high performance computing and parallel software,etc.
    ZHANG Guang-ting,born in 1987,MS,assistant professor,is a member of China Computer Federation.Her main research interests include high perfor-mance computing and parallel software.
  • Supported by:
    National Key R & D Program of China(2017YFB0202502,2018YFC0809306,2017YFB0202105,2016YFB0200803,2017YFB0202302),National Natural Science Foundation of China(61972376) and Beijing Natural Science Foundation of China(L182053).

摘要: 作为基本的数学运算,三角函数的高性能实现对构建处理器的基础软件生态具有重要意义,特别是当前处理器都采用了SIMD架构,基于SIMD实现高性能三角函数具有重要的研究意义和应用价值。对此,文中采用数值分析的方法,对5个常用的三角函数sin,cos,tan,atan,atan2进行了高性能的实现与优化。首先通过分析浮点数IEEE754标准,设计了高效的三角函数算法;然后通过多项式逼近算法中的泰勒公式、帕德近似及雷米兹算法提升了算法精度;最后利用指令流水线与SIMD优化进一步提升了算法性能。实验结果表明,在满足精度的前提下,所实现的三角函数,相较于libm算法库和ARM_M 算法库,在ARM V8计算平台上都获得了较大的性能提升,其中相比libm算法库有1.77~6.26倍的时间性能提升,相比ARM_M算法库有1.34~1.5倍的时间性能提升。

关键词: ARM V8架构, SIMD, 高性能, 三角函数, 数值分析

Abstract: As a basic mathematical operation,the high-performance implementation of trigonometric functions is of great significance to the construction of the basic software ecology of the processor.Especially,the current processors have adopted the SIMD architecture,and the implementation of high-performance trigonometric functions based on SIMD has important research significance and application value.In this regard,this paper uses numerical analysis method to implement and optimize the five commonly used trigonometric functions sin,cos,tan,atan,atan2 with high performance.Based on the analysis of floating-point IEEE754 standard,an efficient trigonometric function algorithm is designed.Then,the algorithm accuracy is further improved by the application of Taylor formula,Pade approximation and Remez algorithm in polynomial approximation algorithm.Finally,the perfor-mance of the algorithm is further improved by using instruction pipeline and SIMD optimization.The experimental results show that,on the premise of satisfying the accuracy,the trigonometric function implemented is compared with libm algorithm library and ARM_M algorithm library,on the ARM V8 computing platform,has achieved great performance improvement,whose time performance is 1.77~6.26 times higher than libm algorithm library,and compared with ARM_M,its times performance is 1.34~1.5 times higher.

Key words: ARM V8 architecture, High performance, Numerical analysis, SIMD, Trigonometric function

中图分类号: 

  • TP391
[1]FU S Y,WU J J,HSU W C.Improving SIMD code generation in QEMU[C]//2015 Design,Automation & Test in Europe Conference & Exhibition(DATE).IEEE,2015:1233-1236.
[2]SHIBATA N.Efficient evaluation methods of elementary functions suitable for SIMD computation[J].Computer Science-Research and Development,2010,25(1):25-32.
[3]STEPHENS N,BILES S,BOETTCHER M,et al.The ARM scalable vector extension[J].IEEE Micro,2017,37(2):26-39.
[4]CHEN S M,GUO S Z,CHEN J X,et al.Optimization Algorithm for Trigonometric Functions Based on Processor with SIMD Function Components[J].Journal of Information Engineering University,2011,12(1):103-106.
[5]CAO D,GUO S Z,ZHANG X.Implementation and Optimization of Extended Function Library Based on SW26010 Processor[J].Computer Engineering,2017(1):61-66.
[6]LI Q Y,WANG N C,YI D Y.Numerical analysis 5th edition[M].Tsinghua University Press,2008.
[7]HACKBUSCH W.Computation of best $$ L{\infty} $$ L∞ exponential sums for 1/x by Remez' algorithm[J].Computing and Visualization in Science,2019,20(1):1-11.
[8]GLUZMAN S,YUKALOV V I.Self-similarly corrected Pade approximants for nonlinear equations[J].International Journal of Modern Physics B,2019,33(29):1950353.
[9]HE Z,ZHANG J,YAO Z.Determining the optimal coefficients of the explicit finite difference scheme using the Remez exchange algorithm[J].Geophysics,2019,84(3):S137-S147.
[10]MACCHIARELLA G.“Equi-Ripple” Synthesis of Multiband Prototype Filters Using a Remez-Like Algorithm[J].IEEE Microwave and Wireless Components Letters,2013,23(5):231-233.
[11]VINSCHEN C,JOHNSTON J.Standard C math library[OL].https://www.sourceware.org/newlib/libm.html.
[12]ARM.Arm Performance Libraries Reference Guide[OL]. https://static.docs.arm.com/101004/1920/arm_performance_libraries_reference_101004_1920_00_en.pdf.
[13]ARM.Arm Optimized Routines[OL].https://github.com/ARM-software/optimized-routines.
[14]ZHA Y L.Qin Jiushao's mathematical thinking method[J].Research on Dialectics of Nature,2003,19(1):87-92.
[1] 李浩东, 胡洁, 范勤勤.
基于并行分区搜索的多模态多目标优化及其应用
Multimodal Multi-objective Optimization Based on Parallel Zoning Search and Its Application
计算机科学, 2022, 49(5): 212-220. https://doi.org/10.11896/jsjkx.210300019
[2] 蒋化南, 张帅, 林宇斐, 李豪.
基于MPI的分布式并行Gazebo仿真优化与测试
Simulation Optimization and Testing Based on Gazebo of MPI Distributed Parallelism
计算机科学, 2021, 48(11A): 672-677. https://doi.org/10.11896/jsjkx.210100109
[3] 李爽, 赵荣彩, 王磊.
面向申威1621通用矩阵乘算法的实现与优化
Implementation and Optimization of Sunway1621 General Matrix Multiplication Algorithm
计算机科学, 2021, 48(11A): 699-704. https://doi.org/10.11896/jsjkx.201200150
[4] 陈国良, 张玉杰.
并行计算学科发展历程
Development of Parallel Computing Subject
计算机科学, 2020, 47(8): 1-4. https://doi.org/10.11896/jsjkx.200600027
[5] 蔡于涵,熊淑华,孙伟恒,Karn Pradeep,何小海.
基于运动矢量细化的帧率上变换与HEVC结合的视频压缩算法
Video Compression Algorithm Combining Frame Rate Up-conversion with HEVC Standard Based on Motion Vector Refinement
计算机科学, 2020, 47(2): 76-82. https://doi.org/10.11896/jsjkx.190500092
[6] 汪洋, 李鹏, 季一木, 樊卫北, 张玉杰, 王汝传, 陈国良.
高性能计算与天文大数据研究综述
High Performance Computing and Astronomical Data:A Survey
计算机科学, 2020, 47(1): 1-6. https://doi.org/10.11896/jsjkx.190900042
[7] 徐传福,王曦,刘舒,陈世钊,林玉.
基于Python的大规模高性能LBM多相流模拟
Large-scale High-performance Lattice Boltzmann Multi-phase Flow Simulations Based on Python
计算机科学, 2020, 47(1): 17-23. https://doi.org/10.11896/jsjkx.190500009
[8] 龚彤艳,张广婷,贾海鹏,袁良.
一种偶数基Cooley-Tukey FFT高性能实现方法
High-performance Implementation Method for Even Basis of Cooley-Tukey FFT
计算机科学, 2020, 47(1): 31-39. https://doi.org/10.11896/jsjkx.190900179
[9] 颜辉, 朱伯靖, 万文, 钟英, DavidAYune.
基于超算暨HPIC-LBM的大时空尺度三维湍流磁重联
HPIC-LBM Method Based Simulation of Large Temporal-Spatial Scale 3D Turbulent Magnetic Reconnection on Supercomputer
计算机科学, 2019, 46(8): 89-94. https://doi.org/10.11896/j.issn.1002-137X.2019.08.014
[10] 贾迅, 钱磊, 邬贵明, 吴东, 谢向辉.
FPGA应用于高性能计算的研究现状和未来挑战
Research Advances and Future Challenges of FPGA-based High Performance Computing
计算机科学, 2019, 46(11): 11-19. https://doi.org/10.11896/jsjkx.191100500C
[11] 周蓓, 黄永忠, 许瑾晨, 郭绍忠.
向量数学库的向量化方法研究
Study on SIMD Method of Vector Math Library
计算机科学, 2019, 46(1): 320-324. https://doi.org/10.11896/j.issn.1002-137X.2019.01.050
[12] 张云泉.
2018年中国高性能计算机发展现状分析与展望
State-of-the-art Analysis and Perspectives of 2018 China HPC Development
计算机科学, 2019, 46(1): 1-5. https://doi.org/10.11896/j.issn.1002-137X.2019.01.001
[13] 金星彤,李鹏,王刚,刘晓光,李忠伟.
基于异或的隐私保护码优化研究
Optimizing Small XOR-based Non-systematic Erasure Codes
计算机科学, 2017, 44(6): 36-42. https://doi.org/10.11896/j.issn.1002-137X.2017.06.006
[14] 单娜娜,周巍,段哲民.
HEVC中的变换系数熵编码优化算法
Improved Entropy Coding Algorithm for Transform Coefficients in HEVC
计算机科学, 2017, 44(6): 290-293. https://doi.org/10.11896/j.issn.1002-137X.2017.06.051
[15] 杨玉星,邱亚娜.
k元n维冒泡排序网络的子网排除
Sub-network Preclusion in (n,k)-bubble-sort Networks
计算机科学, 2017, 44(11): 264-267. https://doi.org/10.11896/j.issn.1002-137X.2017.11.039
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!