计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 41-49.doi: 10.11896/jsjkx.241200053
廖秋承1, 周洋2, 林新华1
LIAO Qiucheng1, ZHOU Yang2, LIN Xinhua1
摘要: 在并行计算程序中插桩计时,是多核处理器中常用的性能测量和分析手段。然而,高精度并行计时的准确性受到计时方法、硬件配置和运行时环境等影响,测量结果不稳定,性能分析结论难以复现。近年来,高性能多核处理器的核心数量不断攀升,给多核心并行计时的准确性带来了更大挑战。目前,在真实计算程序中,高精度并行计时技术面临两大问题:1)无法定量比较不同计时函数的准确性;2)无法定量分析多种因素影响下微秒、毫秒级并行计时分布的偏差幅度。针对上述问题,首先设计了用于定量评测计时结果统计学分布偏差的指标,并开发了支持X86和Armv8指令集的多核心计时结果偏差评测工具ParTES。ParTES可以模拟真实计算场景的缓存特征和计时间隔,定量评测不同计时函数的测量偏差。其次,在鲲鹏、飞腾和海光高性能处理器上开展了微秒和毫秒级并行计时稳定性量化分析。实验结果表明,计时方法、缓存命中率、计时函数邻近指令和服务器硬件配置等因素,均会对并行计时结果的准确性产生影响。在鲲鹏、飞腾和海光处理器上,计时结果偏差最小且偏差幅度变化最稳定的计时方法分别是PAPI的计时函数、POSIX的clock_gettime计时函数和C86指令集汇编计时指令RDTSC。
中图分类号:
[1]MCCALPIN J D.HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Dallas:IEEE Press,2018:225-237. [2]CHUNDURI S,HARMS K,PARKER S,et al.Run-to-run variability on Xeon Phi based cray XC systems[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.New York:Association for Computing Machinery,2017:1-13. [3]COOK B,KURTH T,AUSTIN B,et al.Performance Variability on Xeon Phi[C]//International Conference on High Perfor-mance Computing.Hamburg:Springer International Publishing,2017:419-429. [4]BHATELE A,THIAGARAJAN J J,GROVES T,et al.TheCase of Performance Variability on Dragonfly-based Systems[C]//Proceedings 2020 IEEE 34th International Parallel and Distributed Processing Symposium(IPDPS).New Orleans:IEEE Press,2020:896-905. [5]BHATELE A,MOHROR K,LANGER S H,et al.There goesthe neighborhood:Performance degradation due to nearby jobs[C]//Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis.Denver:Association for Computing Machinery,2013:1-12. [6]DAS R,MUTLU O,MOSCIBRODA T,et al.Aergia:exploiting packet latency slack in on-chip networks[J].ACM SIGARCH Computer Architecture News,2010,38(3):106-116. [7]RÖHL T,TREIBIG J,HAGER G,et al.Overhead Analysis ofPerformance Counter Measurements[C]//Proceedings of the 2014 43rd International Conference on Parallel Processing Workshops.Minneapolis:IEEE Computer Society,2014:176-185. [8]HOEFLER T,BELLI R.Scientific benchmarking of parallelcomputing systems:twelve ways to tell the masses when reporting performance results[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Austin:Association for Computing Machinery,2015:1-12. [9]LIAO Q,LIN J.TacVar:Tackling Variability in Short-Interval Timing Measurements on X86 Processors[C]//2024 IEEE 24th International Symposium on Cluster,Cloud and Internet Computing(CCGrid 2024).Philadelphia:IEEE Computer Society,2024:496-506. [10]ZHAI J,ZHENG L,SUN J,et al.Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):3558-3574. [11]HUNOLD S,CARPEN-AMARIE A,TRÄFF J L.Reproducible MPI Micro-Benchmarking Isn't as Easy as You Think[C]//Proceedings of the 21st European MPI Users' Group Meeting.New York,NY,USA:Association for Computing Machinery,2014:69-76. [12]HUNOLD S,CARPEN-AMARIE A.Reproducible MPI Benchmarking is Still Not as Easy as You Think[J].IEEE Transactions on Parallel and Distributed Systems,2016,27(12):3617-3630. [13]PAOLONI G.How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures[EB/OL].(2010-09-01) [2024-12-07].https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf. [14]KANTOROVICH L V.Mathematical Methods of Organizingand Planning Production[J].Management Science,1960,6(4):366-422. [15]LEONID NISONOVICH VASERSTEIN.Markov Processes over Denumerable Products of Spaces,Describing Large Systems of Automata[J].Problemy Peredachi Informatsii,1969,5(3):64-72. [16]TERPSTRA D,JAGODE H,YOU H,et al.Collecting Perfor-mance Data with PAPI-C[C]//Tools for High Performance Computing 2009.Berlin:Springer,2010:157-173. [17]TREIBIG J,HAGER G,WELLEIN G.LIKWID:A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments[C]//2010 39th International Conference on Parallel Processing Workshops.San Diego:IEEE,2010:207-216. [18]KNÜPFER A,RÖSSEL C,MEY D A,et al.Score-P:A Joint Performance Measurement Run-Time Infrastructure for Periscope,Scalasca,TAU,and Vampir[C]//Tools for High Performance Computing 2011.Berlin:Springer,2012:79-91. [19]ADHIANTO L,BANERJEE S,FAGAN M,et al.HPCTOOLKIT:Tools for Performance Analysis of Optimized Parallel Programs[J].Concurrency and Computation:Practice and Expe-rience,2010,22(6):685-701. [20]WEAVER V M,DONGARRA J.Can hardware PerformanceCounters be Trusted?[C]//2008 IEEE International Symposium on Workload Characterization.Seattle:IEEE,2008:141-150. [21]WEAVER V,DONGARRA J.Can Hardware PerformanceCounters Produce Expected,Deterministic Results? [EB/OL].(2010-12-01)[2024-12-07].https://icl.utk.edu/files/publications/2010/icl-utk-451-2010.pdf. [22]WEAVER V M,TERPSTRA D,MOORE S.Non-determinism and Overcount on Modern Hardware Performance Counter implementations[C]//2013 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).Austin:IEEE,2013:215-224. [23]MCCALPIN J.Memory Bandwidth and Machine Balance inHigh Performance Computers[C]//IEEE Technical Committee on Computer Architecture Newsletter.1995:19-25. [24]CHEN T,GUO Q,TEMAM O,et al.Statistical Performance Comparisons of Computers[J].IEEE Transactions on Compu-ters,2015,64(5):1442-1455. [25]ABEL A,REINEKE J.nanoBench:A Low-Overhead Tool for Running Microbenchmarks on x86 Systems[C]//2020 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).Boston:IEEE,2020:34-46. |
|