计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 27-35.doi: 10.11896/jsjkx.220100125
陈磊, 唐滔, 漆海俊, 姜浩, 何康
CHEN Lei, TANG Tao, QI Hai-jun, JIANG Hao, HE Kang
摘要: 在高性能计算中,求解大规模、大尺度、长时程和病态问题过程中舍入误差的累计都可能会使算法的最终数值结果失真。在不同的计算软硬件资源下,每次运行的结果可能不一致,而这些结果是开发者调试程序和正确性检查的重要依据,会对科研工作的顺利进行造成干扰,因此算法数值结果的可复现性变得至关重要。文中面向飞腾处理器,基于OpenBLAS 软件框架,结合美国伯克利国家实验室的Demmel教授团队开发的ReproBLAS软件中提出的可复现的方法与Castado提出的多层分块技术,使用舍入误差分析和无误差变换等技术,设计出了多线程可复现DGEMV的算法。数值实验显示,所提算法实现了数值计算的可复现性,且输出结果与ReproBLAS相同,验证了所提算法的可靠性。同时,所提算法在相同的测试环境下运行速度至少是ReproBLAS实现算法运行速度的2倍。此外,还将所提算法与日本理化研究所Mukunoki提出的OzBLAS中的可复现DGEMV函数进行对比,同为单线程时该算法的运行速度至少是OzBLAS算法的20倍,在相同多线程数量情况下,该算法的运行速度至少是OzBLAS算法的9倍。理论分析和数值实验均表明,该改进算法比国际上现有的可复现数值算法性能更优。
中图分类号:
[1]TODD R,GENNADY F.Predicts 2015:Introduction to Conditional Numerical Reproducibility(CNR)[OL].(2012-09-05) [2019-11-03].https://software.intel.com/content/www/us/en/develop/articles/introduction-to-the-conditional-numerical-reproducibility-cnr.html. [2]NVIDIA.NVIDIARcuBLAS[OL].[2021-08-02].https://docs.nvida.com/cuda/cublas/index.html. [3]PETER A,HONG D N,JAMES D.Efficient ReproducibleFloating Point Summation and BLAS [R].Berkeley:Electrical Engineering and Computer Sciences,2016. [4]PETER A.Reproducible Parallel Matrix-Vector Multiply [R].Los Alamos:CS 267 Final Report,2015. [5]JAMES D,HONG D N.Fast reproducible floating-point summation [C]//21st IEEE Symposium on Computer Arithmetic.2013:163-172. [6]JAMES D,HONG D N.Parallel Reproducible Summation [J].IEEE Transactions on Computer,2015,64(7):2060-2070. [7]ROMAN L,SYLVAIN C,DAVID D,et al.ExBLAS:Reproducible and Accurate BLAS library[R].Austin,TX:NRE2015 at SC15,2015. [8]ROMAN I,SYLVAIN C,DAVID D,et al.Reproducible triangular solvers for high-performance computing[C]//17th International Conference on Information Technology-New Generations(ITNG 2020).Las Vegas:Springer,2020:353-358. [9]SYLVAIN C,DAVID D,STEF G,et al.Numerical Reprodu-cibility for the Parallel Reduction on Multi-and Many-Core Architectures[J].Parallel Computing,2015,49(4):83-97. [10]KATSUHISA O,TAKESHI O,SIEGFRIED M P,et al.Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications [J].Numerical Algorithms,2012,59(1):95-118. [11]RUMP S,TAKESHI O,SHIN'ICHI O.Accurate Floating-Point Summation Part II:Sign,K-Fold Faithful and Rounding to Nearest [J].SIAM Journal on Scientific Computing,2008,31(1):1269-1302. [12]DAICHI M,TAKESHI O,KATSUHISA O.Accurate and Reproducible BLAS Routines with Ozaki Scheme for Many-core Architectures[C]//Parallel Processing and Applied Mathema-tics.Lublin:Springer,2019:516-527. [13]ANSI/IEEE standard 754-1985,IEEE Standard for BinaryFloating-point Arithmetic[S].New York:Institute of Electrical and Electronics Engineers,Inc,1985. [14]KNUTH D E.The Art of Computer Programming:Seminumerical Algorithms(3rd edition) [M].USA:Addison-Wesly,Rea-dingMA,1998. [15]KAHAN W.Pracniques:further remarks on reducing truncation errors [J].Communications of the ACM,1965,8(1):40-42. [16]DEKKER T J.A floating-point technique for extending the avai-lable precision [J].Numerische Mathematik 1971,18(3):224-242. [17]TAKESHI O,RUMP S,SHIN'ICHI O.Accurate sum and dot product [J].SIAM Journal on Scientific Computing,2005,26(6):1955-1988. [18]ZHANG X,WANG Q,ZHANG Y Q.Model-driven level 3BLAS Performance Optimization on Loongson 3A Processor[C]//2012 IEEE 18th International Conference on Parallel and Distributed Systems(ICPADS).Singapore:IEEE Computer Society,2012:684-691. [19]DAVID R B.Programming with POSIX Threads [M].United States:Addision-Wesley Professional,1997:33-36. [20]ZHU Y K,HAYES W B.Correct Rounding and hybrid approach to exact floating-point summation [J].SIAM Journal on Scientific Computing,2009,31(4):2981-3001. |
[1] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[2] | 吴功兴, 孙兆洋, 琚春华. 考虑中断风险与模糊定价的闭环供应链网络设计模型 Closed-loop Supply Chain Network Design Model Considering Interruption Risk and Fuzzy Pricing 计算机科学, 2022, 49(7): 220-225. https://doi.org/10.11896/jsjkx.201100084 |
[3] | 傅思清, 黎铁军, 张建民. 面向粒子输运程序加速的体系结构设计 Architecture Design for Particle Transport Code Acceleration 计算机科学, 2022, 49(6): 81-88. https://doi.org/10.11896/jsjkx.210600179 |
[4] | 郑智捷. 消解逻辑悖论建立元知识智能化体系 Meta Knowledge Intelligent Systems on Resolving Logic Paradoxes 计算机科学, 2022, 49(1): 9-16. https://doi.org/10.11896/jsjkx.210700023 |
[5] | 刘炜, 阮敏捷, 佘维, 张志鸿, 田钊. 面向物联网的PBFT优化共识算法 PBFT Optimized Consensus Algorithm for Internet of Things 计算机科学, 2021, 48(11): 151-158. https://doi.org/10.11896/jsjkx.210500038 |
[6] | 崔国楠, 王立松, 康介祥, 高忠杰, 王辉, 尹伟. 结合多目标优化算法的模糊聚类有效性指标及应用 Fuzzy Clustering Validity Index Combined with Multi-objective Optimization Algorithm and Its Application 计算机科学, 2021, 48(10): 197-203. https://doi.org/10.11896/jsjkx.200900061 |
[7] | 傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文. 一种面向构件化并行应用程序的性能骨架分析方法 Performance Skeleton Analysis Method Towards Component-based Parallel Applications 计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115 |
[8] | 郭彪, 唐麒, 文智敏, 傅娟, 王玲, 魏急波. 一种面向动态部分可重构片上系统的列表式软硬件划分算法 List-based Software and Hardware Partitioning Algorithm for Dynamic Partial Reconfigurable System-on-Chip 计算机科学, 2021, 48(6): 19-25. https://doi.org/10.11896/jsjkx.200700198 |
[9] | 俞建业, 戚湧, 王宝茁. 基于Spark的车联网分布式组合深度学习入侵检测方法 Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark 计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129 |
[10] | 张航, 唐聃, 蔡红亮. 分布式存储系统中的预测式纠删码研究 Study on Predictive Erasure Codes in Distributed Storage System 计算机科学, 2021, 48(5): 130-139. https://doi.org/10.11896/jsjkx.200300124 |
[11] | 鄂海红, 张田宇, 宋美娜. 基于Web的数据可视化图表渲染优化方法 Web-based Data Visualization Chart Rendering Optimization Method 计算机科学, 2021, 48(3): 119-123. https://doi.org/10.11896/jsjkx.200600038 |
[12] | 王妍, 韩笑, 曾辉, 刘荆欣, 夏长清. 边缘计算环境下服务质量可信的任务迁移节点选择 Task Migration Node Selection with Reliable Service Quality in Edge Computing Environment 计算机科学, 2020, 47(10): 240-246. https://doi.org/10.11896/jsjkx.190900054 |
[13] | 王喆, 唐麒, 王玲, 魏急波. 一种基于模拟退火的动态部分可重构系统划分-调度联合优化算法 Joint Optimization Algorithm for Partition-Scheduling of Dynamic Partial Reconfigurable Systems Based on Simulated Annealing 计算机科学, 2020, 47(8): 26-31. https://doi.org/10.11896/jsjkx.200500110 |
[14] | 王国澎, 杨剑新, 尹飞, 蒋生健. 负载均衡的处理器运算资源分配方法 Computing Resources Allocation with Load Balance in Modern Processor 计算机科学, 2020, 47(8): 41-48. https://doi.org/10.11896/jsjkx.191000148 |
[15] | 庄奕, 杨家海. 限时点到多点跨数据中心传输的多源树调度算法 Multi-source Tree-based Scheduling Algorithm for Deadline-aware P2MP Inter-datacenter Transfers 计算机科学, 2020, 47(7): 213-219. https://doi.org/10.11896/jsjkx.200300069 |
|