Computer Science ›› 2021, Vol. 48 ›› Issue (12): 29-35.doi: 10.11896/jsjkx.201200135

• Computer Architecture • Previous Articles     Next Articles

High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD

YAO Jian-yu1,2, ZHANG Yi-wei3, ZHANG Guang-ting1, JIA Hai-peng1   

  1. 1 State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
    2 School of Computer and Control Engineering,University of Chinese Academy of Sciences,Beijing 100049,China
    3 Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China
  • Received:2020-12-15 Revised:2021-04-21 Online:2021-12-15 Published:2021-11-26
  • About author:YAO Jian-yu,born in 1997,postgra-duate.His main research interests include high performance computing and parallel software,etc.
    ZHANG Guang-ting,born in 1987,MS,assistant professor,is a member of China Computer Federation.Her main research interests include high perfor-mance computing and parallel software.
  • Supported by:
    National Key R & D Program of China(2017YFB0202502,2018YFC0809306,2017YFB0202105,2016YFB0200803,2017YFB0202302),National Natural Science Foundation of China(61972376) and Beijing Natural Science Foundation of China(L182053).

Abstract: As a basic mathematical operation,the high-performance implementation of trigonometric functions is of great significance to the construction of the basic software ecology of the processor.Especially,the current processors have adopted the SIMD architecture,and the implementation of high-performance trigonometric functions based on SIMD has important research significance and application value.In this regard,this paper uses numerical analysis method to implement and optimize the five commonly used trigonometric functions sin,cos,tan,atan,atan2 with high performance.Based on the analysis of floating-point IEEE754 standard,an efficient trigonometric function algorithm is designed.Then,the algorithm accuracy is further improved by the application of Taylor formula,Pade approximation and Remez algorithm in polynomial approximation algorithm.Finally,the perfor-mance of the algorithm is further improved by using instruction pipeline and SIMD optimization.The experimental results show that,on the premise of satisfying the accuracy,the trigonometric function implemented is compared with libm algorithm library and ARM_M algorithm library,on the ARM V8 computing platform,has achieved great performance improvement,whose time performance is 1.77~6.26 times higher than libm algorithm library,and compared with ARM_M,its times performance is 1.34~1.5 times higher.

Key words: ARM V8 architecture, High performance, Numerical analysis, SIMD, Trigonometric function

CLC Number: 

  • TP391
[1]FU S Y,WU J J,HSU W C.Improving SIMD code generation in QEMU[C]//2015 Design,Automation & Test in Europe Conference & Exhibition(DATE).IEEE,2015:1233-1236.
[2]SHIBATA N.Efficient evaluation methods of elementary functions suitable for SIMD computation[J].Computer Science-Research and Development,2010,25(1):25-32.
[3]STEPHENS N,BILES S,BOETTCHER M,et al.The ARM scalable vector extension[J].IEEE Micro,2017,37(2):26-39.
[4]CHEN S M,GUO S Z,CHEN J X,et al.Optimization Algorithm for Trigonometric Functions Based on Processor with SIMD Function Components[J].Journal of Information Engineering University,2011,12(1):103-106.
[5]CAO D,GUO S Z,ZHANG X.Implementation and Optimization of Extended Function Library Based on SW26010 Processor[J].Computer Engineering,2017(1):61-66.
[6]LI Q Y,WANG N C,YI D Y.Numerical analysis 5th edition[M].Tsinghua University Press,2008.
[7]HACKBUSCH W.Computation of best $$ L{\infty} $$ L∞ exponential sums for 1/x by Remez' algorithm[J].Computing and Visualization in Science,2019,20(1):1-11.
[8]GLUZMAN S,YUKALOV V I.Self-similarly corrected Pade approximants for nonlinear equations[J].International Journal of Modern Physics B,2019,33(29):1950353.
[9]HE Z,ZHANG J,YAO Z.Determining the optimal coefficients of the explicit finite difference scheme using the Remez exchange algorithm[J].Geophysics,2019,84(3):S137-S147.
[10]MACCHIARELLA G.“Equi-Ripple” Synthesis of Multiband Prototype Filters Using a Remez-Like Algorithm[J].IEEE Microwave and Wireless Components Letters,2013,23(5):231-233.
[11]VINSCHEN C,JOHNSTON J.Standard C math library[OL].https://www.sourceware.org/newlib/libm.html.
[12]ARM.Arm Performance Libraries Reference Guide[OL]. https://static.docs.arm.com/101004/1920/arm_performance_libraries_reference_101004_1920_00_en.pdf.
[13]ARM.Arm Optimized Routines[OL].https://github.com/ARM-software/optimized-routines.
[14]ZHA Y L.Qin Jiushao's mathematical thinking method[J].Research on Dialectics of Nature,2003,19(1):87-92.
[1] LI Shuang, ZHAO Rong-cai, WANG Lei. Implementation and Optimization of Sunway1621 General Matrix Multiplication Algorithm [J]. Computer Science, 2021, 48(11A): 699-704.
[2] CHEN Guo-liang, ZHANG Yu-jie, . Development of Parallel Computing Subject [J]. Computer Science, 2020, 47(8): 1-4.
[3] WANG Yang, LI Peng, JI Yi-mu, FAN Wei-bei, ZHANG Yu-jie, WANG Ru-chuan, CHEN Guo-liang. High Performance Computing and Astronomical Data:A Survey [J]. Computer Science, 2020, 47(1): 1-6.
[4] XU Chuan-fu,WANG Xi,LIU Shu,CHEN Shi-zhao,LIN Yu. Large-scale High-performance Lattice Boltzmann Multi-phase Flow Simulations Based on Python [J]. Computer Science, 2020, 47(1): 17-23.
[5] GONG Tong-yan,ZHANG Guang-ting,JIA Hai-peng,YUAN Liang. High-performance Implementation Method for Even Basis of Cooley-Tukey FFT [J]. Computer Science, 2020, 47(1): 31-39.
[6] YAN Hui, ZHU Bo-jing, WAN Wen, ZHONG Yin, David A YUNE. HPIC-LBM Method Based Simulation of Large Temporal-Spatial Scale 3D Turbulent Magnetic Reconnection on Supercomputer [J]. Computer Science, 2019, 46(8): 89-94.
[7] JIA Xun, QIAN Lei, WU Gui-ming, WU Dong, XIE Xiang-hui. Research Advances and Future Challenges of FPGA-based High Performance Computing [J]. Computer Science, 2019, 46(11): 11-19.
[8] ZHANG Yun-quan. State-of-the-art Analysis and Perspectives of 2018 China HPC Development [J]. Computer Science, 2019, 46(1): 1-5.
[9] ZHOU Bei, HUANG Yong-zhong, XU Jin-chen, GUO Shao-zhong. Study on SIMD Method of Vector Math Library [J]. Computer Science, 2019, 46(1): 320-324.
[10] JIN Xing-tong, LI Peng, WANG Gang, LIU Xiao-guang and LI Zhong-wei. Optimizing Small XOR-based Non-systematic Erasure Codes [J]. Computer Science, 2017, 44(6): 36-42.
[11] HAO Xin and GUO Shao-zhong. Optimization of 3D Finite Difference Algorithm on Intel MIC [J]. Computer Science, 2017, 44(5): 26-32.
[12] SI Yu-meng, WEI Jian-wen, Simon SEE and James LIN. Parallel Design and Optimization of Galaxy Group Finding Algorithm on Comparation of SGI and Distributed-memory Cluster [J]. Computer Science, 2017, 44(10): 80-84.
[13] CHEN Yong and XU Chao. Symbolic Execution and Human-Machine Interaction Based Auto Vectorization Method [J]. Computer Science, 2016, 43(Z6): 461-466.
[14] YU Hai-ning, HAN Lin and LI Peng-yuan. Structure Optimization for Automatic Vectorization [J]. Computer Science, 2016, 43(2): 210-215.
[15] XU Jin-long ZHAO Rong-cai ZHAO Bo. Research on Non-full Length Usage of SIMD Vector Instruction [J]. Computer Science, 2015, 42(7): 229-233.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!