计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 27-35.doi: 10.11896/jsjkx.220100125

• 高性能计算* 上一篇    下一篇

面向飞腾处理器的多线程可复现DGEMV设计与实现

陈磊, 唐滔, 漆海俊, 姜浩, 何康   

  1. 国防科技大学计算机学院 长沙 410073
  • 收稿日期:2022-01-14 修回日期:2022-05-03 出版日期:2022-10-15 发布日期:2022-10-13
  • 通讯作者: 漆海俊(18321661390@163.com)
  • 作者简介:(gorgeousboy_cl@163.com)
  • 基金资助:
    国家重点研发计划(2020YFA0709803);173项目(2020-JCJQ-ZD-029);科学挑战专题资助项目(TZ2016002)

Design and Implementation of Multithreaded Reproducible DGEMV for Phytium Processor

CHEN Lei, TANG Tao, QI Hai-jun, JIANG Hao, HE Kang   

  1. College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China
  • Received:2022-01-14 Revised:2022-05-03 Online:2022-10-15 Published:2022-10-13
  • About author:CHEN Lei,born in 1992,master,assistant engineer.His main research intere-sts include high performance computation and data mining.
    QI Hai-jun,born in 1998,bachaler,assistant engineer.His main research interests include high performance computation and so on.
  • Supported by:
    National Key Research and Development Program of China(2020YFA0709803),173 Program(2020-JCJQ-ZD-029) and Science Challenge Project(TZ2016002).

摘要: 在高性能计算中,求解大规模、大尺度、长时程和病态问题过程中舍入误差的累计都可能会使算法的最终数值结果失真。在不同的计算软硬件资源下,每次运行的结果可能不一致,而这些结果是开发者调试程序和正确性检查的重要依据,会对科研工作的顺利进行造成干扰,因此算法数值结果的可复现性变得至关重要。文中面向飞腾处理器,基于OpenBLAS 软件框架,结合美国伯克利国家实验室的Demmel教授团队开发的ReproBLAS软件中提出的可复现的方法与Castado提出的多层分块技术,使用舍入误差分析和无误差变换等技术,设计出了多线程可复现DGEMV的算法。数值实验显示,所提算法实现了数值计算的可复现性,且输出结果与ReproBLAS相同,验证了所提算法的可靠性。同时,所提算法在相同的测试环境下运行速度至少是ReproBLAS实现算法运行速度的2倍。此外,还将所提算法与日本理化研究所Mukunoki提出的OzBLAS中的可复现DGEMV函数进行对比,同为单线程时该算法的运行速度至少是OzBLAS算法的20倍,在相同多线程数量情况下,该算法的运行速度至少是OzBLAS算法的9倍。理论分析和数值实验均表明,该改进算法比国际上现有的可复现数值算法性能更优。

关键词: 可复现性, 舍入误差, 无误差变换, DGEMV

Abstract: In high-performance computing,the accumulation of rounding error in the process of solving the large-scale,long time and ill-conditioned problem will lead to invalidated results.These results are useful for the developers to debug programs and check their correctness.Therefore,the reproducibility of the numerical results of the algorithm becomes very important.Based on the OpenBLAS’s framework,combining with Demmel’s reproducible method in ReproBLAS and multilayer block technology proposed by Castaldo,this paper designs a reproducible algorithm of multithreaded DGEMV for Phytium processor with rounding error analysis and error free transformation.Numerical experiments show that the output of the algorithm is the same as that of the ReproBLAS,which verifies the reproducibility.Our algorithm is up to 2x faster than that in ReproBLAS.Compared with the DGEMV function of OzBLAS proposed by Mukunoki,our algorithm runs at least 20x faster than that in OzBLAS with single thread,and 9x faster than that in OzBLAS with multi-threads.Theoretical analysis and numerical experiments illustrate that improved algorithm is accurate,validated and efficiency.

Key words: Reproducibility, Round-off error, Error-free transformation, DGEMV

中图分类号: 

  • TP302
[1]TODD R,GENNADY F.Predicts 2015:Introduction to Conditional Numerical Reproducibility(CNR)[OL].(2012-09-05) [2019-11-03].https://software.intel.com/content/www/us/en/develop/articles/introduction-to-the-conditional-numerical-reproducibility-cnr.html.
[2]NVIDIA.NVIDIARcuBLAS[OL].[2021-08-02].https://docs.nvida.com/cuda/cublas/index.html.
[3]PETER A,HONG D N,JAMES D.Efficient ReproducibleFloating Point Summation and BLAS [R].Berkeley:Electrical Engineering and Computer Sciences,2016.
[4]PETER A.Reproducible Parallel Matrix-Vector Multiply [R].Los Alamos:CS 267 Final Report,2015.
[5]JAMES D,HONG D N.Fast reproducible floating-point summation [C]//21st IEEE Symposium on Computer Arithmetic.2013:163-172.
[6]JAMES D,HONG D N.Parallel Reproducible Summation [J].IEEE Transactions on Computer,2015,64(7):2060-2070.
[7]ROMAN L,SYLVAIN C,DAVID D,et al.ExBLAS:Reproducible and Accurate BLAS library[R].Austin,TX:NRE2015 at SC15,2015.
[8]ROMAN I,SYLVAIN C,DAVID D,et al.Reproducible triangular solvers for high-performance computing[C]//17th International Conference on Information Technology-New Generations(ITNG 2020).Las Vegas:Springer,2020:353-358.
[9]SYLVAIN C,DAVID D,STEF G,et al.Numerical Reprodu-cibility for the Parallel Reduction on Multi-and Many-Core Architectures[J].Parallel Computing,2015,49(4):83-97.
[10]KATSUHISA O,TAKESHI O,SIEGFRIED M P,et al.Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications [J].Numerical Algorithms,2012,59(1):95-118.
[11]RUMP S,TAKESHI O,SHIN'ICHI O.Accurate Floating-Point Summation Part II:Sign,K-Fold Faithful and Rounding to Nearest [J].SIAM Journal on Scientific Computing,2008,31(1):1269-1302.
[12]DAICHI M,TAKESHI O,KATSUHISA O.Accurate and Reproducible BLAS Routines with Ozaki Scheme for Many-core Architectures[C]//Parallel Processing and Applied Mathema-tics.Lublin:Springer,2019:516-527.
[13]ANSI/IEEE standard 754-1985,IEEE Standard for BinaryFloating-point Arithmetic[S].New York:Institute of Electrical and Electronics Engineers,Inc,1985.
[14]KNUTH D E.The Art of Computer Programming:Seminumerical Algorithms(3rd edition) [M].USA:Addison-Wesly,Rea-dingMA,1998.
[15]KAHAN W.Pracniques:further remarks on reducing truncation errors [J].Communications of the ACM,1965,8(1):40-42.
[16]DEKKER T J.A floating-point technique for extending the avai-lable precision [J].Numerische Mathematik 1971,18(3):224-242.
[17]TAKESHI O,RUMP S,SHIN'ICHI O.Accurate sum and dot product [J].SIAM Journal on Scientific Computing,2005,26(6):1955-1988.
[18]ZHANG X,WANG Q,ZHANG Y Q.Model-driven level 3BLAS Performance Optimization on Loongson 3A Processor[C]//2012 IEEE 18th International Conference on Parallel and Distributed Systems(ICPADS).Singapore:IEEE Computer Society,2012:684-691.
[19]DAVID R B.Programming with POSIX Threads [M].United States:Addision-Wesley Professional,1997:33-36.
[20]ZHU Y K,HAYES W B.Correct Rounding and hybrid approach to exact floating-point summation [J].SIAM Journal on Scientific Computing,2009,31(4):2981-3001.
[1] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[2] 吴功兴, 孙兆洋, 琚春华.
考虑中断风险与模糊定价的闭环供应链网络设计模型
Closed-loop Supply Chain Network Design Model Considering Interruption Risk and Fuzzy Pricing
计算机科学, 2022, 49(7): 220-225. https://doi.org/10.11896/jsjkx.201100084
[3] 傅思清, 黎铁军, 张建民.
面向粒子输运程序加速的体系结构设计
Architecture Design for Particle Transport Code Acceleration
计算机科学, 2022, 49(6): 81-88. https://doi.org/10.11896/jsjkx.210600179
[4] 郑智捷.
消解逻辑悖论建立元知识智能化体系
Meta Knowledge Intelligent Systems on Resolving Logic Paradoxes
计算机科学, 2022, 49(1): 9-16. https://doi.org/10.11896/jsjkx.210700023
[5] 刘炜, 阮敏捷, 佘维, 张志鸿, 田钊.
面向物联网的PBFT优化共识算法
PBFT Optimized Consensus Algorithm for Internet of Things
计算机科学, 2021, 48(11): 151-158. https://doi.org/10.11896/jsjkx.210500038
[6] 崔国楠, 王立松, 康介祥, 高忠杰, 王辉, 尹伟.
结合多目标优化算法的模糊聚类有效性指标及应用
Fuzzy Clustering Validity Index Combined with Multi-objective Optimization Algorithm and Its Application
计算机科学, 2021, 48(10): 197-203. https://doi.org/10.11896/jsjkx.200900061
[7] 傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文.
一种面向构件化并行应用程序的性能骨架分析方法
Performance Skeleton Analysis Method Towards Component-based Parallel Applications
计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115
[8] 郭彪, 唐麒, 文智敏, 傅娟, 王玲, 魏急波.
一种面向动态部分可重构片上系统的列表式软硬件划分算法
List-based Software and Hardware Partitioning Algorithm for Dynamic Partial Reconfigurable System-on-Chip
计算机科学, 2021, 48(6): 19-25. https://doi.org/10.11896/jsjkx.200700198
[9] 俞建业, 戚湧, 王宝茁.
基于Spark的车联网分布式组合深度学习入侵检测方法
Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark
计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129
[10] 张航, 唐聃, 蔡红亮.
分布式存储系统中的预测式纠删码研究
Study on Predictive Erasure Codes in Distributed Storage System
计算机科学, 2021, 48(5): 130-139. https://doi.org/10.11896/jsjkx.200300124
[11] 鄂海红, 张田宇, 宋美娜.
基于Web的数据可视化图表渲染优化方法
Web-based Data Visualization Chart Rendering Optimization Method
计算机科学, 2021, 48(3): 119-123. https://doi.org/10.11896/jsjkx.200600038
[12] 王妍, 韩笑, 曾辉, 刘荆欣, 夏长清.
边缘计算环境下服务质量可信的任务迁移节点选择
Task Migration Node Selection with Reliable Service Quality in Edge Computing Environment
计算机科学, 2020, 47(10): 240-246. https://doi.org/10.11896/jsjkx.190900054
[13] 王喆, 唐麒, 王玲, 魏急波.
一种基于模拟退火的动态部分可重构系统划分-调度联合优化算法
Joint Optimization Algorithm for Partition-Scheduling of Dynamic Partial Reconfigurable Systems Based on Simulated Annealing
计算机科学, 2020, 47(8): 26-31. https://doi.org/10.11896/jsjkx.200500110
[14] 王国澎, 杨剑新, 尹飞, 蒋生健.
负载均衡的处理器运算资源分配方法
Computing Resources Allocation with Load Balance in Modern Processor
计算机科学, 2020, 47(8): 41-48. https://doi.org/10.11896/jsjkx.191000148
[15] 庄奕, 杨家海.
限时点到多点跨数据中心传输的多源树调度算法
Multi-source Tree-based Scheduling Algorithm for Deadline-aware P2MP Inter-datacenter Transfers
计算机科学, 2020, 47(7): 213-219. https://doi.org/10.11896/jsjkx.200300069
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!