计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 67-75.doi: 10.11896/jsjkx.241100175
马兆阳, 陈娟, 周一畅, 吴贤瑜, 高鹏飞, 阮文浩, 詹昊明
MA Zhaoyang, CHEN Juan, ZHOU Yichang, WU Xianyu, GAO Pengfei, RUAN Wenhao, ZHAN Haoming
摘要: 最优线程数设置是影响多线程程序性能和功耗的关键之一。然而,目前寻找最优线程数的算法通常是从单一固定起点开始搜索,往往会造成搜索精度低、搜索开销大的问题。最优线程数的分布和位置与多种因素有关,包括程序所属类型、优化目标(性能、功耗和 EDP(Energy-delay Product))、并行的多线程区域、软硬件配置参数等。围绕能效优先的最优线程数搜索问题,提出了能效优先的特定起点分类最优线程数搜索算法( Energy-Efficiency-First Optimal Thread Number Search Algorithm based on Specific Starting Point Classification,简称TS3方法)”,通过设计基于程序分类的特殊起点设定方法来确定搜索起点,并采用启发式算法和二分查找方法搜索最优线程数,提升搜索效率,有效提升了能效优先目标(性能最优、功耗最优、能效 EDP 最优)下的最优线程数搜索精度并降低了搜索开销。在两个 x86 和一个 ARM 平台上用 8 个 benchmark 对算法有效性进行了详细实验验证,结果表明,与 Baseline 相比,TS3方法的性能平均提升 0.29%(平台 A)、0.17%(平台 B)、10.77%(平台 C);功耗平均降低 2.35%(平台 A)、1.87%(平台 B)、15.97%(平台 C);EDP 平均降低 6.36%(平台 A)、5.07%(平台 B)、46.94%(平台 C)。在3个平台上,与目前经典搜索方法相比,TS3方法的性能平均提升 10.16%,功耗平均降低13.45%,EDP 平均降低 23.77%;搜索开销平均降低 86.8%。
中图分类号:
[1]WANG R,LU K,CHEN J,et al.Brief introduction of tianhe exascaleprototype system[J].Tsinghua Science and Technology,2021,26(3):361-369. [2]DUTOT P F,GEORGIOU Y,GLESSER D,et al.Towardsenergy budget control in hpc[C]//2017 17th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing(CCGRID).2017:381-390. [3]LEE J,PARK J H,KIM H,et al.Adaptive execution techniques of parallel programs for multiprocessors[J].Journal of Parallel and Distributed Computing,2010,70(5):467-480. [4]SULEMAN M A,QURESHI M K,PATT Y N.Feedback-driven threading:power-efficient and high-performance execution of multi-threadedworkloads on cmps[C]//Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems.New York:ACM,2008:277-286. [5]SUBRAMANIAN L,SESHADRI V,KIM Y,et al.Mise:Providing performance predictability and improving fairness in shared main memory systems[C]//2013 IEEE 19th International Symposium on High Performance Computer Architecture(HPCA).2013:639-650. [6]JOAO J A,SULEMAN M A,MUTLU O,et al.Bottleneck identification and scheduling in multithreaded applications[J].ACM SIGARCH Computer Architecture News,2012,40(1):223-234. [7]CHEN J,ZHOU W,DONG Y,et al.Analyzingtime-dimensioncommunication characterizations for representative scientific applications onsupercomputer systems[J].Frontiers of Computer Science,2019,13(6):1228-1242. [8]NASA.Nasparallel benchmarks[EB/OL].https://www.nas.nasa.gov/software/npb.html. [9]MOORI M K,ROCHA H M G D A,LORENZON A F,et al.Searching for the ideal number of threads on asymmetric multiprocessors[C]//2023 XIII Brazilian Symposium on Computing Systems Engineering(SBESC).2023:1-6. [10]MALAVE S H,SHINDE S K.Adaptive manta ray foraging optimizer fordetermining optimal thread count on many-core architecture[C]//Congress on Intelligent Systems.Singapore:Springer,2023:209-222. [11]MARR D T,BINNS F,HILL D L,et al.Hyper-ThreadingTechnology in Microprocessor[J].Intel Technology Journal,2014,2(10):1-12. [12]JUE R,HENG L.Fast and accurate long-read assembly with wtdbg2[J].Nature Methods,2019,17(2):155-158. [13]LORENZON A F,DE OLIVEIRA C C,SOUZA J D,et al.Aurora:Seamless Optimization of OpenMP Applications[J]IEEE Transactions on Parallel and Distributed Systems,2019,3(5):1007-1021. [14]PUSUKURI K K,GUPTA R,BHUYAN L N.Thread reinforcer:Dynamically determining number of threads via os level monitoring[C]//2011 IEEE International Symposium on Workload Characterization(IISWC).2011:116-125. [15]LEE J,WU H,RAVICHANDRAN M,et al.Thread tailor:dynamically weaving threads together for efficient,adaptive parallel applications[J].ACM SIGARCH Computer Architecture News,2010,38(3):270-279. [16]MOORE R W,CHILDERS B R.Using utilityprediction models to dynamically choose program thread counts[C]//2012 IEEE International Symposium on Performance Analysis of Systems Software.2012:135-144. [17]DANCHEVA T,GUSEV M,ZDRAVEVSKI V,et al.AnOpenMP runtime profiler/configuration tool for dynamic optimization of the number of threads[C]//2016 39th International Convention on Information and Communication Technology,Electronics and Microelectronics(MIPRO).IEEE,2016:192-197. [18]SCHWARZROCK J,LORENZON A F,NAVAUX P O,et al.Potentialgains in EDP by dynamically adapting the number of threads for OpenMP applications in embedded systems[C]//2017 VII Brazilian Symposiumon Computing Systems Enginee-ring(SBESC).IEEE,2017:79-85. [19]PESTEL S D,STEEN S V D,AKRAM S,et al.RPPM:Rapid performanceprediction of multithreaded workloads on multicore processors[C]//2019 IEEE International Symposium on Performance Analysis of Systems and Software.2019. [20]WANG W,DAVIDSON J W,SOFFA M L.Predicting the me-mory bandwidth and optimal core allocations for multi-threaded applications onlarge-scale numa machines[C]//2016 IEEE International Symposiumon High Performance Computer Architecture(HPCA).2016:419-431. [21]CURTIS-MAURY M,DING X,ANTONOPOULOS C D,et al.An evaluation of OpenMP on current and emerging multithreaded/multicore processors[C]LNTCS.2008:133-144. [22]NIEPLOCHA J,MÁRQUEZ A,FEO J,et al.Evaluating the po-tential of multithreaded platforms for irregular scientific computations[C]//Proceedings of the 4th International Conference on Computing Frontiers.2007:47-58. [23]MOSELEY T.Methods for modeling resource contention on simultaneousmultithreading processors[C]//2005 International Conference on Computer Design.2005. [24]SANJUAN-ESTRADAJF,CASADOLG,GARCÍAI.Adaptive pa-rallel interval branch and bound algorithms based on their performance for multicore architectures[J].the Journal of Supercomputing,2011,58:376-384. |
|