并行计时偏差评测指标及工具

doi:10.11896/jsjkx.241200053

Abstract

Abstract: In parallel computing,instrumenting specific code segments is commonly used for performance evaluation on multicore processors.However,factors such as timing methods,hardware configurations,and runtime environments affect parallel timing accuracy,jeopardizing stability and reproducibility of performance measurements.As the core number of multicore processors grows,accurate parallel timing has grown more challenging.Two key problems remain:1)current method cannot quantitatively compare the accuracy of different timing methods;2)the root cause of parallel timing variability is not fully understood.This paper proposes metrics for evaluating the deviation in measurements and presents ParTES,a tool which emulates realistic cache conditions and timing intervals on X86 and Armv8 CPUs,allowing quantitative evaluation of timing variability across different timing methods.This study performed microsecond-level and millisecond-level analyses of parallel timing deviations on Kunpeng,Phytium,and Hygon processors.The results show that the performance of timing methods,cache status,nearby instructions,and server hardware configurations all influence accuracy is excellent.Among these CPUs,the most stable timing methods are PAPIon Kunpeng,POSIX's clock_gettime on Phytium,and the RDTSC instruction on Hygon.

Key words: High performance computing, Parallel computing, Performance evaluation, Performance analysis, Error analysis

CLC Number:

TP302

LIAO Qiucheng, ZHOU Yang, LIN Xinhua. Metrics and Tools for Evaluating the Deviation in Parallel Timing[J].Computer Science, 2025, 52(5): 41-49.

References

[1]MCCALPIN J D.HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Dallas:IEEE Press,2018:225-237.
[2]CHUNDURI S,HARMS K,PARKER S,et al.Run-to-run variability on Xeon Phi based cray XC systems[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.New York:Association for Computing Machinery,2017:1-13.
[3]COOK B,KURTH T,AUSTIN B,et al.Performance Variability on Xeon Phi[C]//International Conference on High Perfor-mance Computing.Hamburg:Springer International Publishing,2017:419-429.
[4]BHATELE A,THIAGARAJAN J J,GROVES T,et al.TheCase of Performance Variability on Dragonfly-based Systems[C]//Proceedings 2020 IEEE 34th International Parallel and Distributed Processing Symposium(IPDPS).New Orleans:IEEE Press,2020:896-905.
[5]BHATELE A,MOHROR K,LANGER S H,et al.There goesthe neighborhood:Performance degradation due to nearby jobs[C]//Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis.Denver:Association for Computing Machinery,2013:1-12.
[6]DAS R,MUTLU O,MOSCIBRODA T,et al.Aergia:exploiting packet latency slack in on-chip networks[J].ACM SIGARCH Computer Architecture News,2010,38(3):106-116.
[7]RÖHL T,TREIBIG J,HAGER G,et al.Overhead Analysis ofPerformance Counter Measurements[C]//Proceedings of the 2014 43rd International Conference on Parallel Processing Workshops.Minneapolis:IEEE Computer Society,2014:176-185.
[8]HOEFLER T,BELLI R.Scientific benchmarking of parallelcomputing systems:twelve ways to tell the masses when reporting performance results[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Austin:Association for Computing Machinery,2015:1-12.
[9]LIAO Q,LIN J.TacVar:Tackling Variability in Short-Interval Timing Measurements on X86 Processors[C]//2024 IEEE 24th International Symposium on Cluster,Cloud and Internet Computing(CCGrid 2024).Philadelphia:IEEE Computer Society,2024:496-506.
[10]ZHAI J,ZHENG L,SUN J,et al.Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):3558-3574.
[11]HUNOLD S,CARPEN-AMARIE A,TRÄFF J L.Reproducible MPI Micro-Benchmarking Isn't as Easy as You Think[C]//Proceedings of the 21st European MPI Users' Group Meeting.New York,NY,USA:Association for Computing Machinery,2014:69-76.
[12]HUNOLD S,CARPEN-AMARIE A.Reproducible MPI Benchmarking is Still Not as Easy as You Think[J].IEEE Transactions on Parallel and Distributed Systems,2016,27(12):3617-3630.
[13]PAOLONI G.How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures[EB／OL].(2010-09-01) [2024-12-07].https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf.
[14]KANTOROVICH L V.Mathematical Methods of Organizingand Planning Production[J].Management Science,1960,6(4):366-422.
[15]LEONID NISONOVICH VASERSTEIN.Markov Processes over Denumerable Products of Spaces,Describing Large Systems of Automata[J].Problemy Peredachi Informatsii,1969,5(3):64-72.
[16]TERPSTRA D,JAGODE H,YOU H,et al.Collecting Perfor-mance Data with PAPI-C[C]//Tools for High Performance Computing 2009.Berlin:Springer,2010:157-173.
[17]TREIBIG J,HAGER G,WELLEIN G.LIKWID:A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments[C]//2010 39th International Conference on Parallel Processing Workshops.San Diego:IEEE,2010:207-216.
[18]KNÜPFER A,RÖSSEL C,MEY D A,et al.Score-P:A Joint Performance Measurement Run-Time Infrastructure for Periscope,Scalasca,TAU,and Vampir[C]//Tools for High Performance Computing 2011.Berlin:Springer,2012:79-91.
[19]ADHIANTO L,BANERJEE S,FAGAN M,et al.HPCTOOLKIT:Tools for Performance Analysis of Optimized Parallel Programs[J].Concurrency and Computation:Practice and Expe-rience,2010,22(6):685-701.
[20]WEAVER V M,DONGARRA J.Can hardware PerformanceCounters be Trusted?[C]//2008 IEEE International Symposium on Workload Characterization.Seattle:IEEE,2008:141-150.
[21]WEAVER V,DONGARRA J.Can Hardware PerformanceCounters Produce Expected,Deterministic Results? [EB／OL].(2010-12-01)[2024-12-07].https://icl.utk.edu/files/publications/2010/icl-utk-451-2010.pdf.
[22]WEAVER V M,TERPSTRA D,MOORE S.Non-determinism and Overcount on Modern Hardware Performance Counter implementations[C]//2013 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).Austin:IEEE,2013:215-224.
[23]MCCALPIN J.Memory Bandwidth and Machine Balance inHigh Performance Computers[C]//IEEE Technical Committee on Computer Architecture Newsletter.1995:19-25.
[24]CHEN T,GUO Q,TEMAM O,et al.Statistical Performance Comparisons of Computers[J].IEEE Transactions on Compu-ters,2015,64(5):1442-1455.
[25]ABEL A,REINEKE J.nanoBench:A Low-Overhead Tool for Running Microbenchmarks on x86 Systems[C]//2020 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).Boston:IEEE,2020:34-46.

Related Articles 15

[1]	SUN Yueyue, FAN Limin. Error Analysis and Parameter Recommendations for Randomness Test Under Large Sample Conditions [J]. Computer Science, 2025, 52(5): 322-329.
[2]	TAN Zhengyuan, ZHONG Jiaqing, CHEN Juan. AI+HPC:An Overview of Supercomputing System Software and Application Technology Development Driven by “AI+” [J]. Computer Science, 2025, 52(5): 1-10.
[3]	GAO Yiqin, LUO Zhiyu, WANG Yichao, LIN Xinhua. Performance Evaluation and Optimization of Operating System for Domestic Supercomputer [J]. Computer Science, 2025, 52(5): 11-24.
[4]	HUANG Chenxi, LI Jiahui, YAN Hui, ZHONG Ying, LU Yutong. Investigation on Load Balancing Strategies for Lattice Boltzmann Method with Local Grid Refinement [J]. Computer Science, 2025, 52(5): 101-108.
[5]	ZHANG Manjing, HE Yulin, LI Xu, HUANG Zhexue. Distributed Two-stage Clustering Method Based on Node Sampling [J]. Computer Science, 2025, 52(2): 134-144.
[6]	YAN Xiaoting, WANG Xiaoning, DONG Sheng, ZHAO Yining, XIAO Haili. Review on the Development and Application of Checkpointing Technology in High-performanceComputing [J]. Computer Science, 2024, 51(9): 1-14.
[7]	CHEN Yiyang, WANG Xiaoning, YAN Xiaoting, LI Guanlong ZHAO Yining, LU Shasha, XIAO Haili. Study on High Performance Computing Container Checkpoint Technology Based on CRIU [J]. Computer Science, 2024, 51(9): 40-50.
[8]	XU He, ZHOU Tao, LI Peng, QIN Fangfang, JI Yimu. LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor [J]. Computer Science, 2024, 51(9): 51-58.
[9]	DENG Hannian, ZHOU Jie, YANG Bo, YI Lili, FU Guang, ZHOU Peng. Modeling and Analysis of Implementation Process for Civil Aircraft Certification Test Flight Based on Stochastic Petri Net [J]. Computer Science, 2024, 51(6A): 230700050-6.
[10]	ZHANG Tao, LIAO Bin, YU Jiong, LI Ming, SUN Ruina. Benchmarking and Analysis for Graph Neural Network Node Classification Task [J]. Computer Science, 2024, 51(4): 132-150.
[11]	ZHONG Zhenyu, LIN Yongliang, WANG Haotian, LI Dongwen, SUN Yufei, ZHANG Yuzhi. Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices [J]. Computer Science, 2024, 51(12): 129-136.
[12]	LI Siyao, LI Shanglin, LUO Jingzhi. Parallel Computing of Reentry Vehicle Trajectory by Multiple Shooting Method Based onOPENMP [J]. Computer Science, 2024, 51(11A): 231000019-6.
[13]	PENG Weidong, GUO Wei, WEI Lin. Reconfigurable Computing System for Parallel Implementation of SVM Training Based on FPGA [J]. Computer Science, 2024, 51(11A): 231100120-7.
[14]	WANG Xiaozhong, ZHANG Zuyu. Multi Level Parallel Computing for SW26010 Discontinuous Galerkin Finite Element Algorithm [J]. Computer Science, 2024, 51(11A): 240700055-5.
[15]	HE Weilong, SU Lingli, GUO Bingxuan, LI Maosen, HAO Yan. Research and Implementation of Dynamic Scene 3D Perception Technology Based on BinocularEstimation [J]. Computer Science, 2024, 51(11A): 240300045-8.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Metrics and Tools for Evaluating the Deviation in Parallel Timing

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0