Computer Science ›› 2022, Vol. 49 ›› Issue (10): 27-35.doi: 10.11896/jsjkx.220100125

• High Perfonnance Computing • Previous Articles     Next Articles

Design and Implementation of Multithreaded Reproducible DGEMV for Phytium Processor

CHEN Lei, TANG Tao, QI Hai-jun, JIANG Hao, HE Kang   

  1. College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China
  • Received:2022-01-14 Revised:2022-05-03 Online:2022-10-15 Published:2022-10-13
  • About author:CHEN Lei,born in 1992,master,assistant engineer.His main research intere-sts include high performance computation and data mining.
    QI Hai-jun,born in 1998,bachaler,assistant engineer.His main research interests include high performance computation and so on.
  • Supported by:
    National Key Research and Development Program of China(2020YFA0709803),173 Program(2020-JCJQ-ZD-029) and Science Challenge Project(TZ2016002).

Abstract: In high-performance computing,the accumulation of rounding error in the process of solving the large-scale,long time and ill-conditioned problem will lead to invalidated results.These results are useful for the developers to debug programs and check their correctness.Therefore,the reproducibility of the numerical results of the algorithm becomes very important.Based on the OpenBLAS’s framework,combining with Demmel’s reproducible method in ReproBLAS and multilayer block technology proposed by Castaldo,this paper designs a reproducible algorithm of multithreaded DGEMV for Phytium processor with rounding error analysis and error free transformation.Numerical experiments show that the output of the algorithm is the same as that of the ReproBLAS,which verifies the reproducibility.Our algorithm is up to 2x faster than that in ReproBLAS.Compared with the DGEMV function of OzBLAS proposed by Mukunoki,our algorithm runs at least 20x faster than that in OzBLAS with single thread,and 9x faster than that in OzBLAS with multi-threads.Theoretical analysis and numerical experiments illustrate that improved algorithm is accurate,validated and efficiency.

Key words: Reproducibility, Round-off error, Error-free transformation, DGEMV

CLC Number: 

  • TP302
[1]TODD R,GENNADY F.Predicts 2015:Introduction to Conditional Numerical Reproducibility(CNR)[OL].(2012-09-05) [2019-11-03].https://software.intel.com/content/www/us/en/develop/articles/introduction-to-the-conditional-numerical-reproducibility-cnr.html.
[2]NVIDIA.NVIDIARcuBLAS[OL].[2021-08-02].https://docs.nvida.com/cuda/cublas/index.html.
[3]PETER A,HONG D N,JAMES D.Efficient ReproducibleFloating Point Summation and BLAS [R].Berkeley:Electrical Engineering and Computer Sciences,2016.
[4]PETER A.Reproducible Parallel Matrix-Vector Multiply [R].Los Alamos:CS 267 Final Report,2015.
[5]JAMES D,HONG D N.Fast reproducible floating-point summation [C]//21st IEEE Symposium on Computer Arithmetic.2013:163-172.
[6]JAMES D,HONG D N.Parallel Reproducible Summation [J].IEEE Transactions on Computer,2015,64(7):2060-2070.
[7]ROMAN L,SYLVAIN C,DAVID D,et al.ExBLAS:Reproducible and Accurate BLAS library[R].Austin,TX:NRE2015 at SC15,2015.
[8]ROMAN I,SYLVAIN C,DAVID D,et al.Reproducible triangular solvers for high-performance computing[C]//17th International Conference on Information Technology-New Generations(ITNG 2020).Las Vegas:Springer,2020:353-358.
[9]SYLVAIN C,DAVID D,STEF G,et al.Numerical Reprodu-cibility for the Parallel Reduction on Multi-and Many-Core Architectures[J].Parallel Computing,2015,49(4):83-97.
[10]KATSUHISA O,TAKESHI O,SIEGFRIED M P,et al.Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications [J].Numerical Algorithms,2012,59(1):95-118.
[11]RUMP S,TAKESHI O,SHIN'ICHI O.Accurate Floating-Point Summation Part II:Sign,K-Fold Faithful and Rounding to Nearest [J].SIAM Journal on Scientific Computing,2008,31(1):1269-1302.
[12]DAICHI M,TAKESHI O,KATSUHISA O.Accurate and Reproducible BLAS Routines with Ozaki Scheme for Many-core Architectures[C]//Parallel Processing and Applied Mathema-tics.Lublin:Springer,2019:516-527.
[13]ANSI/IEEE standard 754-1985,IEEE Standard for BinaryFloating-point Arithmetic[S].New York:Institute of Electrical and Electronics Engineers,Inc,1985.
[14]KNUTH D E.The Art of Computer Programming:Seminumerical Algorithms(3rd edition) [M].USA:Addison-Wesly,Rea-dingMA,1998.
[15]KAHAN W.Pracniques:further remarks on reducing truncation errors [J].Communications of the ACM,1965,8(1):40-42.
[16]DEKKER T J.A floating-point technique for extending the avai-lable precision [J].Numerische Mathematik 1971,18(3):224-242.
[17]TAKESHI O,RUMP S,SHIN'ICHI O.Accurate sum and dot product [J].SIAM Journal on Scientific Computing,2005,26(6):1955-1988.
[18]ZHANG X,WANG Q,ZHANG Y Q.Model-driven level 3BLAS Performance Optimization on Loongson 3A Processor[C]//2012 IEEE 18th International Conference on Parallel and Distributed Systems(ICPADS).Singapore:IEEE Computer Society,2012:684-691.
[19]DAVID R B.Programming with POSIX Threads [M].United States:Addision-Wesley Professional,1997:33-36.
[20]ZHU Y K,HAYES W B.Correct Rounding and hybrid approach to exact floating-point summation [J].SIAM Journal on Scientific Computing,2009,31(4):2981-3001.
[1] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[2] WU Gong-xing, Sun Zhao-yang, JU Chun-hua. Closed-loop Supply Chain Network Design Model Considering Interruption Risk and Fuzzy Pricing [J]. Computer Science, 2022, 49(7): 220-225.
[3] FU Si-qing, LI Tie-jun, ZHANG Jian-min. Architecture Design for Particle Transport Code Acceleration [J]. Computer Science, 2022, 49(6): 81-88.
[4] Jeffrey ZHENG. Meta Knowledge Intelligent Systems on Resolving Logic Paradoxes [J]. Computer Science, 2022, 49(1): 9-16.
[5] LIU Wei, RUAN Min-jie, SHE Wei, ZHANG Zhi-hong, TIAN Zhao. PBFT Optimized Consensus Algorithm for Internet of Things [J]. Computer Science, 2021, 48(11): 151-158.
[6] CUI Guo-nan, WANG Li-song, KANG Jie-xiang, GAO Zhong-jie, WANG Hui, YIN Wei. Fuzzy Clustering Validity Index Combined with Multi-objective Optimization Algorithm and Its Application [J]. Computer Science, 2021, 48(10): 197-203.
[7] FU Tian-hao, TIAN Hong-yun, JIN Yu-yang, YANG Zhang, ZHAI Ji-dong, WU Lin-ping, XU Xiao-wen. Performance Skeleton Analysis Method Towards Component-based Parallel Applications [J]. Computer Science, 2021, 48(6): 1-9.
[8] GUO Biao, TANG Qi, WEN Zhi-min, FU Juan, WANG Ling, WEI Ji-bo. List-based Software and Hardware Partitioning Algorithm for Dynamic Partial Reconfigurable System-on-Chip [J]. Computer Science, 2021, 48(6): 19-25.
[9] YU Jian-ye, QI Yong, WANG Bao-zhuo. Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark [J]. Computer Science, 2021, 48(6A): 518-523.
[10] ZHANG Hang, TANG Dan, CAI Hong-liang. Study on Predictive Erasure Codes in Distributed Storage System [J]. Computer Science, 2021, 48(5): 130-139.
[11] E Hai-hong, ZHANG Tian-yu, SONG Mei-na. Web-based Data Visualization Chart Rendering Optimization Method [J]. Computer Science, 2021, 48(3): 119-123.
[12] WANG Yan, HAN Xiao, ZENG Hui, LIU Jing-xin, XIA Chang-qing. Task Migration Node Selection with Reliable Service Quality in Edge Computing Environment [J]. Computer Science, 2020, 47(10): 240-246.
[13] WANG Zhe, TANG Qi, WANG Ling, WEI Ji-bo. Joint Optimization Algorithm for Partition-Scheduling of Dynamic Partial Reconfigurable Systems Based on Simulated Annealing [J]. Computer Science, 2020, 47(8): 26-31.
[14] WANG Guo-peng, YANG Jian-xin, YIN Fei, JIANG Sheng-jian. Computing Resources Allocation with Load Balance in Modern Processor [J]. Computer Science, 2020, 47(8): 41-48.
[15] ZHUANG Yi, YANG Jia-hai. Multi-source Tree-based Scheduling Algorithm for Deadline-aware P2MP Inter-datacenter Transfers [J]. Computer Science, 2020, 47(7): 213-219.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!