计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 67-75.doi: 10.11896/jsjkx.241100175

• 高性能计算 • 上一篇    下一篇

TS3:能效优先的特定起点分类最优线程数搜索

马兆阳, 陈娟, 周一畅, 吴贤瑜, 高鹏飞, 阮文浩, 詹昊明   

  1. 国防科技大学计算机学院 长沙 410073
  • 收稿日期:2024-11-28 修回日期:2025-02-27 出版日期:2025-05-15 发布日期:2025-05-12
  • 通讯作者: 陈娟(juanchen@nudt.edu.cn)
  • 作者简介:(maozhaoyang@nudt.edu.cn)
  • 基金资助:
    并行与分布计算全国重点实验室基金(2023-KJWPDL-01)

TS3:Energy-Efficiency-First Optimal Thread Number Search Algorithm Based on Specific Starting Point Classification

MA Zhaoyang, CHEN Juan, ZHOU Yichang, WU Xianyu, GAO Pengfei, RUAN Wenhao, ZHAN Haoming   

  1. College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China
  • Received:2024-11-28 Revised:2025-02-27 Online:2025-05-15 Published:2025-05-12
  • About author:MA Zhaoyang,born in 2004,undergra-duate.His main research interests include high performance computing and so on.
    CHEN Juan,born in 1980,Ph.D,professor.Her main research interests include high performance computing,low-power compiler and power management,etc.
  • Supported by:
    Parallel and Distributed Computing National Key Laboratory Fund(2023-KJWPDL-01).

摘要: 最优线程数设置是影响多线程程序性能和功耗的关键之一。然而,目前寻找最优线程数的算法通常是从单一固定起点开始搜索,往往会造成搜索精度低、搜索开销大的问题。最优线程数的分布和位置与多种因素有关,包括程序所属类型、优化目标(性能、功耗和 EDP(Energy-delay Product))、并行的多线程区域、软硬件配置参数等。围绕能效优先的最优线程数搜索问题,提出了能效优先的特定起点分类最优线程数搜索算法( Energy-Efficiency-First Optimal Thread Number Search Algorithm based on Specific Starting Point Classification,简称TS3方法)”,通过设计基于程序分类的特殊起点设定方法来确定搜索起点,并采用启发式算法和二分查找方法搜索最优线程数,提升搜索效率,有效提升了能效优先目标(性能最优、功耗最优、能效 EDP 最优)下的最优线程数搜索精度并降低了搜索开销。在两个 x86 和一个 ARM 平台上用 8 个 benchmark 对算法有效性进行了详细实验验证,结果表明,与 Baseline 相比,TS3方法的性能平均提升 0.29%(平台 A)、0.17%(平台 B)、10.77%(平台 C);功耗平均降低 2.35%(平台 A)、1.87%(平台 B)、15.97%(平台 C);EDP 平均降低 6.36%(平台 A)、5.07%(平台 B)、46.94%(平台 C)。在3个平台上,与目前经典搜索方法相比,TS3方法的性能平均提升 10.16%,功耗平均降低13.45%,EDP 平均降低 23.77%;搜索开销平均降低 86.8%。

关键词: 多线程程序, 能效优化, 最优线程数, 随机森林算法, 启发式算法

Abstract: Optimal thread number setting is one of the key factors affecting the performance and power consumption of multi-threaded programs.However,current algorithms for finding the optimal number of threads usually start the search from a single fixed point,which cause the problem of low precision and large search overhead.The distribution and location of the optimal number of threads are related to various,factors,including types of programs,optimization objectives(performance,power consumption,and EDP),parallel multi-threaded areas,and software-hardware configuration parameters.This paper focuses on the problem of searching for the optimal number of threads with an emphasis on energy efficiency and proposes an energy-efficiency-first optimal thread number search algorithm based on specific starting point classification(abbreviated as TS3 method).By designing a multi-threaded program classifier to optimize the setting of search starting points,and applying heuristic and binary search algorithms to enhance search efficiency,the method effectively improves the accuracy of the optimal number of threads search under energy efficiency priorities(optimal performance,optimal powerconsumption,optimal EDP) and reduces search costs.The effectiveness of the algorithm is experimentally validated using eight benchmarks on two x86 platforms and one ARM platform.Compared to the baseline,the TS3method achieves an average performance improvement of 0.29%(Platform A),0.17%(Platform B),and 10.77%(Platform C);average power consumption reduction of 2.35%(Platform A),1.87%(PlatformB),and 15.97%(Platform C);and average EDP reduction of 6.36%(Platform A),5.07%(Platform B),and 46.94%(Platform C).Across the three platforms,compared to current classical search methods,the TS3 method demonstrates an average performance improvement of 10.16%,an average reduction in power consumption of 13.45%,and an average reduction in EDP of 23.77%,the search overhead is reduced by 86.8%.

Key words: Multithreaded program, Energy efficiency optimization, Optimal number of threads, Random forest algorithm, Heuristic algorithm

中图分类号: 

  • TP302
[1]WANG R,LU K,CHEN J,et al.Brief introduction of tianhe exascaleprototype system[J].Tsinghua Science and Technology,2021,26(3):361-369.
[2]DUTOT P F,GEORGIOU Y,GLESSER D,et al.Towardsenergy budget control in hpc[C]//2017 17th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing(CCGRID).2017:381-390.
[3]LEE J,PARK J H,KIM H,et al.Adaptive execution techniques of parallel programs for multiprocessors[J].Journal of Parallel and Distributed Computing,2010,70(5):467-480.
[4]SULEMAN M A,QURESHI M K,PATT Y N.Feedback-driven threading:power-efficient and high-performance execution of multi-threadedworkloads on cmps[C]//Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems.New York:ACM,2008:277-286.
[5]SUBRAMANIAN L,SESHADRI V,KIM Y,et al.Mise:Providing performance predictability and improving fairness in shared main memory systems[C]//2013 IEEE 19th International Symposium on High Performance Computer Architecture(HPCA).2013:639-650.
[6]JOAO J A,SULEMAN M A,MUTLU O,et al.Bottleneck identification and scheduling in multithreaded applications[J].ACM SIGARCH Computer Architecture News,2012,40(1):223-234.
[7]CHEN J,ZHOU W,DONG Y,et al.Analyzingtime-dimensioncommunication characterizations for representative scientific applications onsupercomputer systems[J].Frontiers of Computer Science,2019,13(6):1228-1242.
[8]NASA.Nasparallel benchmarks[EB/OL].https://www.nas.nasa.gov/software/npb.html.
[9]MOORI M K,ROCHA H M G D A,LORENZON A F,et al.Searching for the ideal number of threads on asymmetric multiprocessors[C]//2023 XIII Brazilian Symposium on Computing Systems Engineering(SBESC).2023:1-6.
[10]MALAVE S H,SHINDE S K.Adaptive manta ray foraging optimizer fordetermining optimal thread count on many-core architecture[C]//Congress on Intelligent Systems.Singapore:Springer,2023:209-222.
[11]MARR D T,BINNS F,HILL D L,et al.Hyper-ThreadingTechnology in Microprocessor[J].Intel Technology Journal,2014,2(10):1-12.
[12]JUE R,HENG L.Fast and accurate long-read assembly with wtdbg2[J].Nature Methods,2019,17(2):155-158.
[13]LORENZON A F,DE OLIVEIRA C C,SOUZA J D,et al.Aurora:Seamless Optimization of OpenMP Applications[J]IEEE Transactions on Parallel and Distributed Systems,2019,3(5):1007-1021.
[14]PUSUKURI K K,GUPTA R,BHUYAN L N.Thread reinforcer:Dynamically determining number of threads via os level monitoring[C]//2011 IEEE International Symposium on Workload Characterization(IISWC).2011:116-125.
[15]LEE J,WU H,RAVICHANDRAN M,et al.Thread tailor:dynamically weaving threads together for efficient,adaptive parallel applications[J].ACM SIGARCH Computer Architecture News,2010,38(3):270-279.
[16]MOORE R W,CHILDERS B R.Using utilityprediction models to dynamically choose program thread counts[C]//2012 IEEE International Symposium on Performance Analysis of Systems Software.2012:135-144.
[17]DANCHEVA T,GUSEV M,ZDRAVEVSKI V,et al.AnOpenMP runtime profiler/configuration tool for dynamic optimization of the number of threads[C]//2016 39th International Convention on Information and Communication Technology,Electronics and Microelectronics(MIPRO).IEEE,2016:192-197.
[18]SCHWARZROCK J,LORENZON A F,NAVAUX P O,et al.Potentialgains in EDP by dynamically adapting the number of threads for OpenMP applications in embedded systems[C]//2017 VII Brazilian Symposiumon Computing Systems Enginee-ring(SBESC).IEEE,2017:79-85.
[19]PESTEL S D,STEEN S V D,AKRAM S,et al.RPPM:Rapid performanceprediction of multithreaded workloads on multicore processors[C]//2019 IEEE International Symposium on Performance Analysis of Systems and Software.2019.
[20]WANG W,DAVIDSON J W,SOFFA M L.Predicting the me-mory bandwidth and optimal core allocations for multi-threaded applications onlarge-scale numa machines[C]//2016 IEEE International Symposiumon High Performance Computer Architecture(HPCA).2016:419-431.
[21]CURTIS-MAURY M,DING X,ANTONOPOULOS C D,et al.An evaluation of OpenMP on current and emerging multithreaded/multicore processors[C]LNTCS.2008:133-144.
[22]NIEPLOCHA J,MÁRQUEZ A,FEO J,et al.Evaluating the po-tential of multithreaded platforms for irregular scientific computations[C]//Proceedings of the 4th International Conference on Computing Frontiers.2007:47-58.
[23]MOSELEY T.Methods for modeling resource contention on simultaneousmultithreading processors[C]//2005 International Conference on Computer Design.2005.
[24]SANJUAN-ESTRADAJF,CASADOLG,GARCÍAI.Adaptive pa-rallel interval branch and bound algorithms based on their performance for multicore architectures[J].the Journal of Supercomputing,2011,58:376-384.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!