基于神经网络的循环分块大小预测

doi:10.11896/jsjkx.191200180

摘要/Abstract

摘要： 循环程序的优化一直是程序优化的重点, 循环分块作为一种典型的循环程序优化技术已被广泛地研究和应用。分块大小的选择对循环程序的性能有着重要影响, 分块大小的选择复杂多变且高度依赖程序和硬件。传统的静态分析和启发式经验搜索的人工和时间成本过高, 缺少通用性和可移植性。为此, 考虑使用有良好高维表示特性的神经网络方法来学习程序与硬件复杂交互过程中分块大小与程序性能的隐含关联。从问题规模、循环结构、循环内操作的局部性等方面抽取出一组新的29维特征, 对问题规模为1024~2048的随机大小的6类内核程序(3维循环、2维数据)的数十万行示例进行实验。串行模型(TSS-T6)相比GCC-O2默认优化实现了6.64倍的平均加速比, 相比穷尽搜索实现了98.5%的平均最大可用性能, 相比Pluto默认分块优化实现了平均9.9%的性能提升。并行模型(TSSP-T6-Search)相比OpenMP默认优化实现了2.41倍的平均加速比, 相比穷尽搜索实现了91.7%的平均最大可用性能, 同时与Pluto默认分块并行优化相比得到了平均9%的性能提升。

关键词: 编译优化, 缓存优化, 人工神经网络, 循环程序分块, 自动调优

Abstract: Optimization of loop programs is an important topic in program optimization.As a typical technique of loop optimization, loop tiling has been widely studied and applied.The selection of tile size, which is complex and highly dependent on program and hardware, has a great impact on the performance of loops.Traditional approaches based on static analysis and heuristic experience search have high cost of labor and time, and lack generality and portability.Therefore, the neural network method, which is good at representing high-dimensional information, is considered to learn the hidden correlation between tiling size and program performance, which is a result of complex interaction between program and hardware.A new group of features with 29 dimensions are extracted based on size of the problem, structure of the loop, locality of the operations within the loop.The experiments are carried out on hundreds of thousands of lines of six kinds of kernel programs (3D loop, 2D data) with random sizes in the range of 1024 to 2048.The sequential model (TSS-T6) achieves an average speedup of 6.64 compared with GCC-O2, 98.5% of the average maximum available performancecompared with the exhaustive search, and an average 9.9% performance improvement compared with Pluto.The parallel model (TSSP-T6-Search) achieves an average speedup of 2.41 compared with the OpenMP default optimization, 91.7% of the average maximum available performance compared with the exhaustive search, and an average 9% performance improvement compared with Pluto default parallel tiling optimization.

Key words: Artificial neural network, Automatic tuning, Cache optimization, Compilation optimization, Loop tiling

中图分类号:

TP314

池昊宇, 陈长波. 基于神经网络的循环分块大小预测[J]. 计算机科学, 2020, 47(8): 62-70. https://doi.org/10.11896/jsjkx.191200180

CHI Hao-yu, CHEN Chang-bo. Prediction of Loop Tiling Size Based on Neural Network[J]. Computer Science, 2020, 47(8): 62-70. https://doi.org/10.11896/jsjkx.191200180

参考文献

[1] KASPERSKY K.Code Optimization:Effective Memory Usage[M].A-List Publishing, 2003.
[2] ASHOURI A H, PALERMO G, CAVAZOS J, et al.AutomaticTuning of Compilers Using Machine Learning[M].Springer International Publishing, 2018.
[3] WANG Z, O’BOYLE M.Machine learning in compiler optimization[J].Proceedings of the IEEE, 2018, 106(11):1879-1901.
[4] ASHOURI A H, KILLIAN W, CAVAZOS J, et al.A Survey onCompiler Autotuning using Machine Learning[J].ACM Computing Surveys, 2018, 51(5):1-42.
[5] LIU S, WU W G, ZHAO B, et al.Loop tiling technology for local and parallel optimization[J].Computer research and development, 2015, 52(5):1160-1176.
[6] YUKI T, RENGANARAYANAN L, RAJOPADHYE S, et al.Automatic creation of tile size selection models[C]∥Procee-dings of the 8th annual IEEE/ACM international symposium on Code generation and optimization.ACM, 2010:190-199.
[7] KISUKI T, KNIJNENBURG P M W, O'BOYLE M F P.Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation[C]∥Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques.Piscataway, NJ:IEEE, 2000:237-246.
[8] STEPHENSON M, AMARASINGHE S P.Predicting unrollfactors using supervised classification[C]∥ International Symposium on Code Generation & Optimization.IEEE, 2005.
[9] CUMMINS C, PETOUMENOS P, WANG Z, et al.End-to-end deep learning of optimization heuristics[C]∥2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).IEEE, 2017:219-232.
[10]ASHOURI A H, MARIANI G, PALERMO G, et al.Cobayn:Compiler autotuning framework using bayesian networks[J].ACM Transactions on Architecture and Code Optimization (TACO), 2016, 13(2):1-25.
[11]WANG Z, O’BOYLE M F P.Mapping parallelism to multi-cores:a machine learning based approach[C]∥ACM Sigplan Notices.ACM, 2009:75-84.
[12]CASTRO M, GOES L F W, RIBEIRO C P, et al.A machine learning-based approach for thread mapping on transactional memory applications[C]∥2011 18th International Conference on High Performance Computing.IEEE, 2011:1-10.
[13]KIM D G, RENGANARAYANAN L, ROSTRON D, et al.Multi-level tiling:M for the price of one[C]∥Proceedings of the 2007 ACM/IEEE conference on Supercomputing(SC’ 07).IEEE, 2007:1-12.
[14]MEHTA S, BEERAKA G, YEW P C.Tile size selection revisited[J].ACM Transactions on Architecture and Code Optimization (TACO), 2013, 10(4):1-27.
[15]MEHTA S, GARG R, TRIVEDI N, et al.Turbotiling:Leveraging prefetching to boost performance of tiled codes[C]∥Proceedings of the 2016 International Conference on Supercompu-ting.ACM, 2016:1-12.
[16]CHAME J, MOON S.A tile selection algorithm for data locality and cache interference[C]∥Proceedings of the 13th Internatio-nal Conference on Supercomputing.ACM, 1999:492-499.
[17]COLEMAN S, MCKINLEY K S.Tile size selection using cache organization and data layout[J].ACM SIGPLAN Notices, 1995, 30(6):279-290.
[18]YOTOV K, PINGALI K, STODGHILL P.Think globally, search locally[C]∥Proceedings of the 19th annual international conference on Supercomputing.ACM, 2005:141-150.
[19]WHALEY R C, DONGARRA J J.Automatically tuned linear algebra software[C]∥Proceedings of the 1998 ACM/IEEE Conference on Supercomputing(SC’98).IEEE, 1998:38-38.
[20]WHALEY R C, PETITET A, DONGARRA J J.Automated empirical optimizations of software and the ATLAS project[J].Parallel computing, 2001, 27(1/2):3-35.
[21]SHIRAKO J, SHARMA K, FAUZIA N, et al.Analytical bounds for optimal tile size selection[C]∥International Conference on Compiler Construction.Springer, Berlin, Heidelberg, 2012:101-121.
[22]FRAGUELA B B, CARMUEJA M G, ANDRADE D.Optimal Tile Size Selection Guided by Analytical Models[C]∥Parallel Computing, 2005:565-572.
[23]LIU S, CUI Y, JIANG Q, et al.An efficient tile size selection model based on machine learning[J].Journal of Parallel and Distributed Computing, 2018, 121:27-41.
[24]RAHMAN M, POUCHET L N, SADAYAPPAN P.Neural network assisted tile size selection[C]∥International Workshop on Automatic Performance Tuning.Berkeley, CA:Springer Verlag.2010.
[25]MALIK A M.Optimal tile size selection problem using machine learning[C]∥2012 11th International Conference on Machine Learning and Applications.IEEE, 2012, 2:275-280.
[26]BENGIO Y, COURVILLE A, GOODFELLOW I J.Deep learning:adaptive computation and machine learning[J].Bengio.A.Courville, 2016, 3:56-69.
[27]BONDHUGULA U, HARTONO A, RAMANUJAM J, et al.A practical automatic polyhedral parallelizer and locality optimizer[C]∥Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation.ACM, 2008:101-113.
[28]ALAN J S.Cache Memories[J].ACM Computing Surveys,
1982, 14(3):473-530.
[29]LEVINTHAL D.Performance analysis guide for intel core i7processor and intel xeon 5500 processors[J].Intel Performance Analysis Guide, 2009, 30:18.
[30]PATTERSON D A, HENNESSY J L, GOLDBERG D.Computer architecture:a quantitative approach[M]∥Computer Architecture:A Quantitative Approach.Morgan Kaufmann Publishers Inc.2008, 153-162.
[31]POUCHET L N.Polybench:The polyhedral benchmark suite[J/OL].http://www.cs.ucla.edu/pouchet/software/polybench, 2012.
[32]IPEK E, MCKEE S A, SINGH K, et al.Efficient architectural design space exploration via predictive modeling[J].ACM Transactions on Architecture and Code Optimization (TACO), 2008, 4(4):1-34.
[33]LEATHER H, BONILLA E, O’BOYLE M.Automatic feature generation for machine learning based optimizing compilation[C]∥Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization.IEEE Computer Society, 2009:81-91.
[34]HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al.Improving neural networks by preventing co-adaptation of feature detectors[J].arXiv:1207.0580, 2012.
[35]LIU F, LI G, HU X, et al.Research progress of program understanding based on deep learning[J].Computer Research and Development, 2019, 56(8):1605-1620.
[36]JIN Z, LIU F, LI G.Program understanding:present and future[J].Journal of Software, 2019, 30(1):110-126.

相关文章 15

[1]	宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[2]	陆浩松, 胡勇华, 王书盈, 周新莲, 李慧祥. 向量DSP的混合资源启发式循环展开因子选择方法研究 Study on Hybrid Resource Heuristic Loop Unrolling Factor Selection Method Based on Vector DSP 计算机科学, 2022, 49(6A): 777-783. https://doi.org/10.11896/jsjkx.210400146
[3]	池昊宇, 陈长波. 基于机器学习的编译器自动调优综述 Survey on Automatic Tuning of Compilers by Machine Learning 计算机科学, 2022, 49(1): 241-251. https://doi.org/10.11896/jsjkx.210100113
[4]	唐镇, 胡勇华, 陆浩松, 王书盈. 基于弱约束指派的DSP寄存器偶对分配算法研究 Research on DSP Register Pairs Allocation Algorithm with Weak Assigning Constraints 计算机科学, 2021, 48(6A): 587-595. https://doi.org/10.11896/jsjkx.200600061
[5]	胡伟方, 陈云, 李颖颖, 商建东. 基于数据重用分析的多面体循环合并策略 Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation 计算机科学, 2021, 48(12): 49-58. https://doi.org/10.11896/jsjkx.210200071
[6]	王鹏, 苏伟, 张久文, 刘映杰, 王臻睿. 基于船舶自动识别系统与人工神经网络的船舶载重预测 Prediction of Vessel Load Based on Vessel Automatic Identification System and Artificial Neural Network 计算机科学, 2020, 47(6A): 49-53. https://doi.org/10.11896/JsJkx.191000074
[7]	徐江峰谭玉龙. 基于机器学习的HBase配置参数优化研究 Research on HBase Configuration Parameter Optimization Based on Machine Learning 计算机科学, 2020, 47(6A): 474-479. https://doi.org/10.11896/JsJkx.190900046
[8]	张经, 杨健, 苏鹏. 语音识别中单音节识别研究综述 Survey of Monosyllable Recognition in Speech Recognition 计算机科学, 2020, 47(11A): 172-174. https://doi.org/10.11896/jsjkx.200200006
[9]	韦建文,许志耿,王丙强,Simon SEE,林新华. 异构集群上的宏基因组聚类优化 Accelerating Gene Clustering on Heterogeneous Clusters 计算机科学, 2017, 44(3): 20-22. https://doi.org/10.11896/j.issn.1002-137X.2017.03.005
[10]	赵章明,冯径,施恩,舒晓村. 带启发信息的蚁群神经网络训练算法 h-ACOR:An ACOR Algorithm with Heuristic Information for Neural Network Training 计算机科学, 2017, 44(11): 284-288. https://doi.org/10.11896/j.issn.1002-137X.2017.11.043
[11]	周奚,薛善良. 基于改进的粗糙集和神经网络的WSN故障诊断 WSN Fault Diagnosis with Improved Rough Set and Neural Network 计算机科学, 2016, 43(Z11): 21-25. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.005
[12]	乔非,葛彦昊. 基于BP神经网络的就业招聘企业客户分类问题研究 Customer Classification Model of Employers by Using BP Neural Networks 计算机科学, 2015, 42(Z11): 1-4.
[13]	丁山,宋丽晓. 一种改进的视网膜图像中微小动脉瘤的检测算法 Improved Method of Microaneurysm Detection Algorithm Based on Digital Fundus Images 计算机科学, 2014, 41(12): 269-274. https://doi.org/10.11896/j.issn.1002-137X.2014.12.058
[14]	尚兴宏,尚曦乐,章静,钱焕延. 无线传感器节点的故障诊断研究 Research on Fault Diagnosis of Wireless Sensor Nodes 计算机科学, 2013, 40(Z6): 327-329.
[15]	葛红美,徐超,陈念,廖希密. 一种面向嵌入式系统总线的低功耗优化方法 Low Power Optimization Method Oriented to Embedded System’s Bus 计算机科学, 2013, 40(12): 31-36.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed