计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 18-26.doi: 10.11896/jsjkx.220100147

• 高性能计算* 上一篇    下一篇

基于神经网络模型的stencil循环最优分块大小预测

包怡坤1,2, 张鹏2,3, 徐小文2,3, 莫则尧3,4   

  1. 1 中国工程物理研究院研究生院 北京 100094
    2 北京应用物理与计算数学研究所 北京 100088
    3 中物院高性能数值模拟软件中心 北京 100088
    4 中国工程物理研究院 四川 绵阳 621900
  • 收稿日期:2022-01-15 修回日期:2022-05-06 出版日期:2022-10-15 发布日期:2022-10-13
  • 通讯作者: 张鹏(zhang_peng@iapcm.ac.cn)
  • 作者简介:(baoyikun19@gscaep.ac.cn)
  • 基金资助:
    国家自然科学基金(62032023)

Prediction of Optimal Loop Tiling Size for stencil Computation Based on Neural Network Model

BAO Yi-kun1,2, ZHANG Peng2,3, XU Xiao-wen2,3, MO Ze-yao3,4   

  1. 1 Graduate School of China Academy of Engineering Physics,Beijing 100094,China
    2 Institute of Applied Physics and Computational Mathematics,Beijing 100088,China
    3 CAEP Software Center for High Performance Numerical Simulation,Beijing 100088,China
    4 China Academy of Engineering Physics,Mianyang,Sichuan 621900,China
  • Received:2022-01-15 Revised:2022-05-06 Online:2022-10-15 Published:2022-10-13
  • About author:BAO Yi-kun,born in 1997,postgra-duate.Her main research interests include performance optimization,parallel computing and machine learning.
    ZHANG Peng,born in 1990,associate researcher.His main research interests include parallel computing,high performance mathematics softwares and software performance optimizations on multi- and many-core processors.
  • Supported by:
    National Natural Science Foundation of China(62032023).

摘要: stencil循环是科学与工程计算应用中最主要的计算核心之一。循环分块技术可有效改善stencil循环的数据局部性,提高计算并行度。分块的大小选择对stencil循环的性能影响很大,传统的分块大小选择方法通常在时间开销、人工成本、分块选择精度等方面存在短板,实用性较差。文中提出了一种基于人工神经网络的分块大小选择方法,用于预测三维Jacobi型stencil循环程序的最优分块。对来源于实际数值模拟软件中的11个stencil循环进行最优分块预测,实验结果显示,在单核串行和多核并行两种场景下,程序使用模型预测分块相比不分块的性能提升分别为2%和35%,与网格搜索方法的分块性能相当,但在线预测时间开销仅约为后者的1/30 000。此外,相比基于静态分析的Turbo-tiling方法,预测最优分块的实测性能平均提升了约9%。

关键词: stencil计算, 循环分块技术, 机器学习, 人工神经网络

Abstract: Stencil computation is one kind of the most important loop kernels in scientific and engineering computing applications.Loop tiling can effectively improve the data locality of stencil computation and the degree of computational parallelism,but the best tile size is hard to choose.Traditional tile size selection methods usually have shortcomings in some ways of time overhead,labor cost and model accuracy.In this paper,a tile size selection method based on artificial neural network is proposed to predict the optimal tile size of three-dimensional Jacobi stencil loop programs.Experimental results show that,for 11 real stencil programs,the performance improvement of the programs using the model prediction tile size compared with the non tiling is 2% and 35% in serial and parallel tests respectively.Compared with the well-known grid search method,our method has a similar prediction accuracy,but only takes one 30 thousandth of the online time cost.In addition,compared with the Turbo-tiling method,our method improves the performance of tiled codes nearly 9% in average.

Key words: Stencil computation, Loop tiling technology, Machine learning, Artificial neural network

中图分类号: 

  • TP314
[1]DATTA K.Auto-tuning stencil codes for cache-based multicore platforms[D].Berkeley:University of California,2009.
[2] XU Y T,FU H H,GAN L,et al.Performance Optimization and Analysis for Different Stencil Kernels on Multi-Core and Many-Core Architectures[C]//2013 CCF National Annual Conference on High Performance Computing.2013:628-637.
[3]LIU S,WU W G,ZHAO B,et al.Loop Tiling for Optimization of Locality and Parallelism[J].Journal of Computer Research and Development,2015,52(5):1160-1176.
[4]WOLF M,LAM M.A data Locality Optimizing Algorithm[C]//Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation.1991:30-44.
[5]LAM M S.The cache preformance and optimizations of Blocked Argorithm[J].ACM SIGPLAN Notices,1991(4):63-74.
[6]MITCHELL N,HOGSTEDT K,CARTER L,et al.Quantifying the Multi-Level Nature of Tiling Interactions[J].International Journal of Parallel Programming,1998,26(6):641-670.
[7]SARKAR V,MEGIDDO N.An analytical model for loop tiling and its solution[C]//2000 IEEE International Symposium on Performance Analysis of Systems and Software.ISPASS(Cat.No.00EX422).IEEE,2000:146-153.
[8]HSU C H,KREMER U.A quantitative analysis of tile size selection algorithms[J].The Journal of Supercomputing,2004,27(3):279-294.
[9]LIU J,ZHANG Y,DING W,et al.On-chip cache hierarchy-aware tile scheduling for multicore machines[C]//International Symposium on Code Generation and Optimization(CGO 2011).IEEE,2011:161-170.
[10]MEHTA S,BEERAKA G,YEW P C.Tile size selection revisited[J].ACM Transactions on Architecture and Code Optimization(TACO),2013,10(4):1-27.
[11]MEHTA S,GARG R,TRIVEDI N,et al.Turbotiling:Leveraging prefetching to boost performance of tiled codes[C]//Proceedings of the 2016 International Conference on Supercompu-ting.2016:1-12.
[12]SATO Y,YUKI T,ENDO T.An autotuning framework forscalable execution of tiled code via iterative polyhedral compilation[J].ACM Transactions on Architecture and Code Optimization(TACO),2019,15(4):1-23.
[13]ZHU Y,PANG J M,XU J L,et al.Adaptive Tiling Size Algo-rithm for 3D Stencil Computation on SW26010 Many-core Processor[J].Journal of Computer Science,2021,48(6):10-18.
[14]KISUKI T,KNIJNENBURG P M W,O'BOYLE M F P.Combined selection of tile sizes and unroll factors using iterative compilation[C]//Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques(Cat.No.PR00622).IEEE,2000:237-246.
[15]WHALEY R C,PETITET A,DONGARRA J J.Automated em-pirical optimizations of software and the ATLAS project[J].Parallel Computing,2001,27(1/2):3-35.
[16]KNIJNENBURG P M W,KISUKI T,GALLIVAN K,et al.The effect of cache models on iterative compilation for combined ti-ling and unrolling[J].Concurrency and Computation:Practice and Experience,2004,16(2/3):247-270.
[17]CHEN C,CHAME J,HALL M.Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy[C]//International Symposium on Code Generation and Optimization.IEEE,2005:111-122.
[18]YOTOV K,PINGALI K,STODGHILL P.Think globally,search locally[C]//Proceedings of the 19th Annual Interna-tional Conference on Supercomputing.2005:141-150.
[19]TIWARI A,CHEN C,CHAME J,et al.A scalable auto-tuning framework for compiler optimization[C]//2009 IEEE International Symposium on Parallel & Distributed Processing.IEEE,2009:1-12.
[20]GOTO K,VAN DE GEIJN R.High-performance implementa-tion of the level-3 BLAS[J].ACM Transactions on Mathematical Software(TOMS),2008,35(1):1-14.
[21]SHIRAKO J,SHARMA K,FAUZIA N,et al.Analytical bounds for optimal tile size selection[C]//International Conference on Compiler Construction.Berlin:Springer,2012:101-121.
[22]YUKI T,RENGANARAYANAN L,RAJOPADHYE S,et al.Automatic creation of tile size selection models[C]//Procee-dings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization.2010:190-199.
[23]MALIK A M.Optimal tile size selection problem using machine learning[C]//2012 11th International Conference on Machine Learning and Applications.IEEE,2012:275-280.
[24]ZHANG P,FANG J,YANG C,et al.Optimizing streaming para-llelism on heterogeneous many-core architectures[J].IEEE Transactions on Parallel and Distributed Systems,2020,31(8):1878-1896.
[25]LIU S,CUI Y,JIANG Q,et al.An efficient tile size selection model based on machine learning[J].Journal of Parallel and Distributed Computing,2018,121:27-41.
[26]CHI H Y,CHEN C B.Prediction of Loop Tiling Size Based on Neural Network[J].Journal of Computer Science,2020:47(8):62-70.
[27]KAMIL S,HUSBANDS P,OLIKER L,et al.Impact of modern memory subsystems on cache optimizations for stencil computations[C]//Proceedings of the 2005 Workshop on Memory System Performance.2005:36-43.
[28]BORCHANI H,VARANDO G,BIELZA C,et al.A survey on multi-output regression[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2015,5(5):216-233.
[1] 冷典典, 杜鹏, 陈建廷, 向阳.
面向自动化集装箱码头的AGV行驶时间估计
Automated Container Terminal Oriented Travel Time Estimation of AGV
计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[2] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[3] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[4] 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩.
基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究
Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network
计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[5] 张光华, 高天娇, 陈振国, 于乃文.
基于N-Gram静态分析技术的恶意软件分类研究
Study on Malware Classification Based on N-Gram Static Analysis Technology
计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[6] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[7] 李亚茹, 张宇来, 王佳晨.
面向超参数估计的贝叶斯优化方法综述
Survey on Bayesian Optimization Methods for Hyper-parameter Tuning
计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[8] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[9] 肖治鸿, 韩晔彤, 邹永攀.
基于多源数据和逻辑推理的行为识别技术研究
Study on Activity Recognition Based on Multi-source Data and Logical Reasoning
计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[10] 姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮.
一种基于异质模型融合的 Android 终端恶意软件检测方法
Android Malware Detection Method Based on Heterogeneous Model Fusion
计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[11] 王飞, 黄涛, 杨晔.
基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究
Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion
计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[12] 许杰, 祝玉坤, 邢春晓.
机器学习在金融资产定价中的应用研究综述
Application of Machine Learning in Financial Asset Pricing:A Review
计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127
[13] 纪璎芮, 袁良, 张云泉.
红黑Gauss-Seidel Stencil并行性和局部性优化
Parallelization and Locality Optimization for Red-Black Gauss-Seidel Stencil
计算机科学, 2022, 49(5): 363-370. https://doi.org/10.11896/jsjkx.220100119
[14] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[15] 李野, 陈松灿.
基于物理信息的神经网络:最新进展与展望
Physics-informed Neural Networks:Recent Advances and Prospects
计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!