计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 62-70.doi: 10.11896/jsjkx.191200180

• 高性能计算 • 上一篇    下一篇

基于神经网络的循环分块大小预测

池昊宇, 陈长波   

  1. 中国科学院大学重庆学院 重庆 400714
    中国科学院大学计算机科学与技术学院 北京 100049
  • 出版日期:2020-08-15 发布日期:2020-08-10
  • 通讯作者: 陈长波(chenchangbo@cigit.ac.cn)
  • 作者简介:chihaoyu@cigit.ac.cn
  • 基金资助:
    国家自然科学基金面上项目(11771421, 11671377, 61572024);中国科学院“西部之光”;重庆市院士牵头科技创新引导专项(cstc2018jcyj-yszxX0002)

Prediction of Loop Tiling Size Based on Neural Network

CHI Hao-yu, CHEN Chang-bo   

  1. Chongqing School, University of Chinese Academy of Sciences, Chongqing 400714, China
    School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:CHI Hao-yu, born in 1996, postgra-duate.His main research interests include compiler optimization and machine learning.
    CHEN Chang-bo, born in 1981, Ph.D, associate professor.His main research interests include symbolic-numeric computation, automatic parallelization and optimzation of computer programs and machine learning.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (11771421, 11671377, 61572024), Chinese Academy of Sciences “Light of the West”, Chongqing Academician-led Science and Technology Innovation Guidance Project (cstc2018jcyj-yszxX0002).

摘要: 循环程序的优化一直是程序优化的重点, 循环分块作为一种典型的循环程序优化技术已被广泛地研究和应用。分块大小的选择对循环程序的性能有着重要影响, 分块大小的选择复杂多变且高度依赖程序和硬件。传统的静态分析和启发式经验搜索的人工和时间成本过高, 缺少通用性和可移植性。为此, 考虑使用有良好高维表示特性的神经网络方法来学习程序与硬件复杂交互过程中分块大小与程序性能的隐含关联。从问题规模、循环结构、循环内操作的局部性等方面抽取出一组新的29维特征, 对问题规模为1024~2048的随机大小的6类内核程序(3维循环、2维数据)的数十万行示例进行实验。串行模型(TSS-T6)相比GCC-O2默认优化实现了6.64倍的平均加速比, 相比穷尽搜索实现了98.5%的平均最大可用性能, 相比Pluto默认分块优化实现了平均9.9%的性能提升。并行模型(TSSP-T6-Search)相比OpenMP默认优化实现了2.41倍的平均加速比, 相比穷尽搜索实现了91.7%的平均最大可用性能, 同时与Pluto默认分块并行优化相比得到了平均9%的性能提升。

关键词: 编译优化, 自动调优, 循环程序分块, 人工神经网络, 缓存优化

Abstract: Optimization of loop programs is an important topic in program optimization.As a typical technique of loop optimization, loop tiling has been widely studied and applied.The selection of tile size, which is complex and highly dependent on program and hardware, has a great impact on the performance of loops.Traditional approaches based on static analysis and heuristic experience search have high cost of labor and time, and lack generality and portability.Therefore, the neural network method, which is good at representing high-dimensional information, is considered to learn the hidden correlation between tiling size and program performance, which is a result of complex interaction between program and hardware.A new group of features with 29 dimensions are extracted based on size of the problem, structure of the loop, locality of the operations within the loop.The experiments are carried out on hundreds of thousands of lines of six kinds of kernel programs (3D loop, 2D data) with random sizes in the range of 1024 to 2048.The sequential model (TSS-T6) achieves an average speedup of 6.64 compared with GCC-O2, 98.5% of the average maximum available performancecompared with the exhaustive search, and an average 9.9% performance improvement compared with Pluto.The parallel model (TSSP-T6-Search) achieves an average speedup of 2.41 compared with the OpenMP default optimization, 91.7% of the average maximum available performance compared with the exhaustive search, and an average 9% performance improvement compared with Pluto default parallel tiling optimization.

Key words: Compilation optimization, Automatic tuning, Loop tiling, Artificial neural network, Cache optimization

中图分类号: 

  • TP314
[1] KASPERSKY K.Code Optimization:Effective Memory Usage[M].A-List Publishing, 2003.
[2] ASHOURI A H, PALERMO G, CAVAZOS J, et al.AutomaticTuning of Compilers Using Machine Learning[M].Springer International Publishing, 2018.
[3] WANG Z, O’BOYLE M.Machine learning in compiler optimization[J].Proceedings of the IEEE, 2018, 106(11):1879-1901.
[4] ASHOURI A H, KILLIAN W, CAVAZOS J, et al.A Survey onCompiler Autotuning using Machine Learning[J].ACM Computing Surveys, 2018, 51(5):1-42.
[5] LIU S, WU W G, ZHAO B, et al.Loop tiling technology for local and parallel optimization[J].Computer research and development, 2015, 52(5):1160-1176.
[6] YUKI T, RENGANARAYANAN L, RAJOPADHYE S, et al.Automatic creation of tile size selection models[C]∥Procee-dings of the 8th annual IEEE/ACM international symposium on Code generation and optimization.ACM, 2010:190-199.
[7] KISUKI T, KNIJNENBURG P M W, O'BOYLE M F P.Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation[C]∥Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques.Piscataway, NJ:IEEE, 2000:237-246.
[8] STEPHENSON M, AMARASINGHE S P.Predicting unrollfactors using supervised classification[C]∥ International Symposium on Code Generation & Optimization.IEEE, 2005.
[9] CUMMINS C, PETOUMENOS P, WANG Z, et al.End-to-end deep learning of optimization heuristics[C]∥2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).IEEE, 2017:219-232.
[10] ASHOURI A H, MARIANI G, PALERMO G, et al.Cobayn:Compiler autotuning framework using bayesian networks[J].ACM Transactions on Architecture and Code Optimization (TACO), 2016, 13(2):1-25.
[11] WANG Z, O’BOYLE M F P.Mapping parallelism to multi-cores:a machine learning based approach[C]∥ACM Sigplan Notices.ACM, 2009:75-84.
[12] CASTRO M, GOES L F W, RIBEIRO C P, et al.A machine learning-based approach for thread mapping on transactional memory applications[C]∥2011 18th International Conference on High Performance Computing.IEEE, 2011:1-10.
[13] KIM D G, RENGANARAYANAN L, ROSTRON D, et al.Multi-level tiling:M for the price of one[C]∥Proceedings of the 2007 ACM/IEEE conference on Supercomputing(SC’ 07).IEEE, 2007:1-12.
[14] MEHTA S, BEERAKA G, YEW P C.Tile size selection revisited[J].ACM Transactions on Architecture and Code Optimization (TACO), 2013, 10(4):1-27.
[15] MEHTA S, GARG R, TRIVEDI N, et al.Turbotiling:Leveraging prefetching to boost performance of tiled codes[C]∥Proceedings of the 2016 International Conference on Supercompu-ting.ACM, 2016:1-12.
[16] CHAME J, MOON S.A tile selection algorithm for data locality and cache interference[C]∥Proceedings of the 13th Internatio-nal Conference on Supercomputing.ACM, 1999:492-499.
[17] COLEMAN S, MCKINLEY K S.Tile size selection using cache organization and data layout[J].ACM SIGPLAN Notices, 1995, 30(6):279-290.
[18] YOTOV K, PINGALI K, STODGHILL P.Think globally, search locally[C]∥Proceedings of the 19th annual international conference on Supercomputing.ACM, 2005:141-150.
[19] WHALEY R C, DONGARRA J J.Automatically tuned linear algebra software[C]∥Proceedings of the 1998 ACM/IEEE Conference on Supercomputing(SC’98).IEEE, 1998:38-38.
[20] WHALEY R C, PETITET A, DONGARRA J J.Automated empirical optimizations of software and the ATLAS project[J].Parallel computing, 2001, 27(1/2):3-35.
[21] SHIRAKO J, SHARMA K, FAUZIA N, et al.Analytical bounds for optimal tile size selection[C]∥International Conference on Compiler Construction.Springer, Berlin, Heidelberg, 2012:101-121.
[22] FRAGUELA B B, CARMUEJA M G, ANDRADE D.Optimal Tile Size Selection Guided by Analytical Models[C]∥Parallel Computing, 2005:565-572.
[23] LIU S, CUI Y, JIANG Q, et al.An efficient tile size selection model based on machine learning[J].Journal of Parallel and Distributed Computing, 2018, 121:27-41.
[24] RAHMAN M, POUCHET L N, SADAYAPPAN P.Neural network assisted tile size selection[C]∥International Workshop on Automatic Performance Tuning.Berkeley, CA:Springer Verlag.2010.
[25] MALIK A M.Optimal tile size selection problem using machine learning[C]∥2012 11th International Conference on Machine Learning and Applications.IEEE, 2012, 2:275-280.
[26] BENGIO Y, COURVILLE A, GOODFELLOW I J.Deep learning:adaptive computation and machine learning[J].Bengio.A.Courville, 2016, 3:56-69.
[27] BONDHUGULA U, HARTONO A, RAMANUJAM J, et al.A practical automatic polyhedral parallelizer and locality optimizer[C]∥Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation.ACM, 2008:101-113.
[28] ALAN J S.Cache Memories[J].ACM Computing Surveys, 1982, 14(3):473-530.
[29] LEVINTHAL D.Performance analysis guide for intel core i7processor and intel xeon 5500 processors[J].Intel Performance Analysis Guide, 2009, 30:18.
[30] PATTERSON D A, HENNESSY J L, GOLDBERG D.Computer architecture:a quantitative approach[M]∥Computer Architecture:A Quantitative Approach.Morgan Kaufmann Publishers Inc.2008, 153-162.
[31] POUCHET L N.Polybench:The polyhedral benchmark suite[J/OL].http://www.cs.ucla.edu/pouchet/software/polybench, 2012.
[32] IPEK E, MCKEE S A, SINGH K, et al.Efficient architectural design space exploration via predictive modeling[J].ACM Transactions on Architecture and Code Optimization (TACO), 2008, 4(4):1-34.
[33] LEATHER H, BONILLA E, O’BOYLE M.Automatic feature generation for machine learning based optimizing compilation[C]∥Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization.IEEE Computer Society, 2009:81-91.
[34] HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al.Improving neural networks by preventing co-adaptation of feature detectors[J].arXiv:1207.0580, 2012.
[35] LIU F, LI G, HU X, et al.Research progress of program understanding based on deep learning[J].Computer Research and Development, 2019, 56(8):1605-1620.
[36] JIN Z, LIU F, LI G.Program understanding:present and future[J].Journal of Software, 2019, 30(1):110-126.
[1] 王鹏, 苏伟, 张久文, 刘映杰, 王臻睿. 基于船舶自动识别系统与人工神经网络的船舶载重预测[J]. 计算机科学, 2020, 47(6A): 49-53.
[2] 徐江峰谭玉龙. 基于机器学习的HBase配置参数优化研究[J]. 计算机科学, 2020, 47(6A): 474-479.
[3] 张经, 杨健, 苏鹏. 语音识别中单音节识别研究综述[J]. 计算机科学, 2020, 47(11A): 172-174.
[4] 韦建文,许志耿,王丙强,Simon SEE,林新华. 异构集群上的宏基因组聚类优化[J]. 计算机科学, 2017, 44(3): 20-22.
[5] 赵章明,冯径,施恩,舒晓村. 带启发信息的蚁群神经网络训练算法[J]. 计算机科学, 2017, 44(11): 284-288.
[6] 周奚,薛善良. 基于改进的粗糙集和神经网络的WSN故障诊断[J]. 计算机科学, 2016, 43(Z11): 21-25.
[7] 乔非,葛彦昊. 基于BP神经网络的就业招聘企业客户分类问题研究[J]. 计算机科学, 2015, 42(Z11): 1-4.
[8] 丁山,宋丽晓. 一种改进的视网膜图像中微小动脉瘤的检测算法[J]. 计算机科学, 2014, 41(12): 269-274.
[9] 尚兴宏,尚曦乐,章静,钱焕延. 无线传感器节点的故障诊断研究[J]. 计算机科学, 2013, 40(Z6): 327-329.
[10] 葛红美,徐超,陈念,廖希密. 一种面向嵌入式系统总线的低功耗优化方法[J]. 计算机科学, 2013, 40(12): 31-36.
[11] 张自敏,樊艳英,陈冠萍. 改进的BP神经网络在地方GDP预测中的应用[J]. 计算机科学, 2012, 39(Z11): 108-110.
[12] 徐步刊,周兴社,梁韵基,王海鹏,於志文. 一种场景驱动的情境感知计算框架[J]. 计算机科学, 2012, 39(3): 216-222.
[13] 田祖伟,孙光. 基于谓词代码的编译优化技术研究[J]. 计算机科学, 2010, 37(5): 130-133.
[14] . 基于机器学习的风险预测方法研究[J]. 计算机科学, 2009, 36(4): 205-207.
[15] . 基于汇编代码的指令调度器的设计与实现[J]. 计算机科学, 2009, 36(3): 45-47.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .