计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 41-49.doi: 10.11896/jsjkx.241200053

• 高性能计算 • 上一篇    下一篇

并行计时偏差评测指标及工具

廖秋承1, 周洋2, 林新华1   

  1. 1 上海交通大学高性能计算中心 上海 200240
    2 浙江省科技厅 杭州 310006
  • 收稿日期:2024-12-09 修回日期:2025-02-18 出版日期:2025-05-15 发布日期:2025-05-12
  • 通讯作者: 林新华(james@sjtu.edu.cn)
  • 作者简介:(keyliao@sjtu.edu.cn)
  • 基金资助:
    国家自然科学基金(62072300)

Metrics and Tools for Evaluating the Deviation in Parallel Timing

LIAO Qiucheng1, ZHOU Yang2, LIN Xinhua1   

  1. 1 Center for High-Performance Computing,Shanghai Jiao Tong University,Shanghai 200240,China
    2 Science and Technology Department of Zhejiang Province,Hangzhou 310006,China
  • Received:2024-12-09 Revised:2025-02-18 Online:2025-05-15 Published:2025-05-12
  • About author:
    LIAO Qiucheng,born in 1994,engineer,is a member of CCF(No.P6171M).His main research interests include high-performance computing and so on.
    LIN Xinhua,born in 1979,Ph.D,senior engineer,Ph.D supervisor,is a distinguished member of CCF(No.23737D).His main research interests include high-performance computing and so on.
  • Supported by:
    National Natural Science Foundation of China(62072300).

摘要: 在并行计算程序中插桩计时,是多核处理器中常用的性能测量和分析手段。然而,高精度并行计时的准确性受到计时方法、硬件配置和运行时环境等影响,测量结果不稳定,性能分析结论难以复现。近年来,高性能多核处理器的核心数量不断攀升,给多核心并行计时的准确性带来了更大挑战。目前,在真实计算程序中,高精度并行计时技术面临两大问题:1)无法定量比较不同计时函数的准确性;2)无法定量分析多种因素影响下微秒、毫秒级并行计时分布的偏差幅度。针对上述问题,首先设计了用于定量评测计时结果统计学分布偏差的指标,并开发了支持X86和Armv8指令集的多核心计时结果偏差评测工具ParTES。ParTES可以模拟真实计算场景的缓存特征和计时间隔,定量评测不同计时函数的测量偏差。其次,在鲲鹏、飞腾和海光高性能处理器上开展了微秒和毫秒级并行计时稳定性量化分析。实验结果表明,计时方法、缓存命中率、计时函数邻近指令和服务器硬件配置等因素,均会对并行计时结果的准确性产生影响。在鲲鹏、飞腾和海光处理器上,计时结果偏差最小且偏差幅度变化最稳定的计时方法分别是PAPI的计时函数、POSIX的clock_gettime计时函数和C86指令集汇编计时指令RDTSC。

关键词: 高性能计算, 并行计算, 性能评测, 性能分析, 误差分析

Abstract: In parallel computing,instrumenting specific code segments is commonly used for performance evaluation on multicore processors.However,factors such as timing methods,hardware configurations,and runtime environments affect parallel timing accuracy,jeopardizing stability and reproducibility of performance measurements.As the core number of multicore processors grows,accurate parallel timing has grown more challenging.Two key problems remain:1)current method cannot quantitatively compare the accuracy of different timing methods;2)the root cause of parallel timing variability is not fully understood.This paper proposes metrics for evaluating the deviation in measurements and presents ParTES,a tool which emulates realistic cache conditions and timing intervals on X86 and Armv8 CPUs,allowing quantitative evaluation of timing variability across different timing methods.This study performed microsecond-level and millisecond-level analyses of parallel timing deviations on Kunpeng,Phytium,and Hygon processors.The results show that the performance of timing methods,cache status,nearby instructions,and server hardware configurations all influence accuracy is excellent.Among these CPUs,the most stable timing methods are PAPIon Kunpeng,POSIX's clock_gettime on Phytium,and the RDTSC instruction on Hygon.

Key words: High performance computing, Parallel computing, Performance evaluation, Performance analysis, Error analysis

中图分类号: 

  • TP302
[1]MCCALPIN J D.HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Dallas:IEEE Press,2018:225-237.
[2]CHUNDURI S,HARMS K,PARKER S,et al.Run-to-run variability on Xeon Phi based cray XC systems[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.New York:Association for Computing Machinery,2017:1-13.
[3]COOK B,KURTH T,AUSTIN B,et al.Performance Variability on Xeon Phi[C]//International Conference on High Perfor-mance Computing.Hamburg:Springer International Publishing,2017:419-429.
[4]BHATELE A,THIAGARAJAN J J,GROVES T,et al.TheCase of Performance Variability on Dragonfly-based Systems[C]//Proceedings 2020 IEEE 34th International Parallel and Distributed Processing Symposium(IPDPS).New Orleans:IEEE Press,2020:896-905.
[5]BHATELE A,MOHROR K,LANGER S H,et al.There goesthe neighborhood:Performance degradation due to nearby jobs[C]//Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis.Denver:Association for Computing Machinery,2013:1-12.
[6]DAS R,MUTLU O,MOSCIBRODA T,et al.Aergia:exploiting packet latency slack in on-chip networks[J].ACM SIGARCH Computer Architecture News,2010,38(3):106-116.
[7]RÖHL T,TREIBIG J,HAGER G,et al.Overhead Analysis ofPerformance Counter Measurements[C]//Proceedings of the 2014 43rd International Conference on Parallel Processing Workshops.Minneapolis:IEEE Computer Society,2014:176-185.
[8]HOEFLER T,BELLI R.Scientific benchmarking of parallelcomputing systems:twelve ways to tell the masses when reporting performance results[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Austin:Association for Computing Machinery,2015:1-12.
[9]LIAO Q,LIN J.TacVar:Tackling Variability in Short-Interval Timing Measurements on X86 Processors[C]//2024 IEEE 24th International Symposium on Cluster,Cloud and Internet Computing(CCGrid 2024).Philadelphia:IEEE Computer Society,2024:496-506.
[10]ZHAI J,ZHENG L,SUN J,et al.Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):3558-3574.
[11]HUNOLD S,CARPEN-AMARIE A,TRÄFF J L.Reproducible MPI Micro-Benchmarking Isn't as Easy as You Think[C]//Proceedings of the 21st European MPI Users' Group Meeting.New York,NY,USA:Association for Computing Machinery,2014:69-76.
[12]HUNOLD S,CARPEN-AMARIE A.Reproducible MPI Benchmarking is Still Not as Easy as You Think[J].IEEE Transactions on Parallel and Distributed Systems,2016,27(12):3617-3630.
[13]PAOLONI G.How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures[EB/OL].(2010-09-01) [2024-12-07].https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf.
[14]KANTOROVICH L V.Mathematical Methods of Organizingand Planning Production[J].Management Science,1960,6(4):366-422.
[15]LEONID NISONOVICH VASERSTEIN.Markov Processes over Denumerable Products of Spaces,Describing Large Systems of Automata[J].Problemy Peredachi Informatsii,1969,5(3):64-72.
[16]TERPSTRA D,JAGODE H,YOU H,et al.Collecting Perfor-mance Data with PAPI-C[C]//Tools for High Performance Computing 2009.Berlin:Springer,2010:157-173.
[17]TREIBIG J,HAGER G,WELLEIN G.LIKWID:A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments[C]//2010 39th International Conference on Parallel Processing Workshops.San Diego:IEEE,2010:207-216.
[18]KNÜPFER A,RÖSSEL C,MEY D A,et al.Score-P:A Joint Performance Measurement Run-Time Infrastructure for Periscope,Scalasca,TAU,and Vampir[C]//Tools for High Performance Computing 2011.Berlin:Springer,2012:79-91.
[19]ADHIANTO L,BANERJEE S,FAGAN M,et al.HPCTOOLKIT:Tools for Performance Analysis of Optimized Parallel Programs[J].Concurrency and Computation:Practice and Expe-rience,2010,22(6):685-701.
[20]WEAVER V M,DONGARRA J.Can hardware PerformanceCounters be Trusted?[C]//2008 IEEE International Symposium on Workload Characterization.Seattle:IEEE,2008:141-150.
[21]WEAVER V,DONGARRA J.Can Hardware PerformanceCounters Produce Expected,Deterministic Results? [EB/OL].(2010-12-01)[2024-12-07].https://icl.utk.edu/files/publications/2010/icl-utk-451-2010.pdf.
[22]WEAVER V M,TERPSTRA D,MOORE S.Non-determinism and Overcount on Modern Hardware Performance Counter implementations[C]//2013 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).Austin:IEEE,2013:215-224.
[23]MCCALPIN J.Memory Bandwidth and Machine Balance inHigh Performance Computers[C]//IEEE Technical Committee on Computer Architecture Newsletter.1995:19-25.
[24]CHEN T,GUO Q,TEMAM O,et al.Statistical Performance Comparisons of Computers[J].IEEE Transactions on Compu-ters,2015,64(5):1442-1455.
[25]ABEL A,REINEKE J.nanoBench:A Low-Overhead Tool for Running Microbenchmarks on x86 Systems[C]//2020 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).Boston:IEEE,2020:34-46.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!