计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 49-58.doi: 10.11896/jsjkx.210200071

• 计算机体系结构* 上一篇    下一篇

基于数据重用分析的多面体循环合并策略

胡伟方1,2, 陈云1,2, 李颖颖3, 商建东1   

  1. 1 河南省超级计算中心 郑州450000
    2 郑州大学信息工程学院 郑州450000
    3 战略支援部队信息工程大学 郑州450000
  • 收稿日期:2021-02-07 修回日期:2021-05-20 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 商建东(shangjiandong@zzu.edu.cn)
  • 作者简介:iewfhu@foxmail.com

Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation

HU Wei-fang1,2, CHEN Yun1,2, LI Ying-ying3, SHANG Jian-dong1   

  1. 1 Supercomputing Center of Henan Province,Zhengzhou 450000,China
    2 School of Information Engineering,Zhengzhou University,Zhengzhou 450000,China
    3 PLA Strategic Support Force Information Engineering University,Zhengzhou 450000,China
  • Received:2021-02-07 Revised:2021-05-20 Online:2021-12-15 Published:2021-11-26
  • About author:HU Wei-fang,born in 1995,master,is a student member of China Computer Federation.His main research interests include advanced compiler technology and high performance computing.
    SHANG Jian-dong,born in 1968,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include data mining and high performance computing.

摘要: 现有多面体编译工具往往使用一些简单的启发式策略来寻找最优的语句合并,对于不同的待优化程序,需要手工调整循环合并策略以获得最佳性能。针对这一问题,面向多核CPU目标平台,文中提出了一种基于数据重用分析的循环合并策略。该策略避免了不必要的且会影响数据局部性利用的合并限制:针对调度的不同阶段,提出了面向不同并行层次的并行性合并限制;对于数组访问关系较为复杂的语句,提出了面向CPU高速缓存优化的分块性合并限制。相较于以往的合并策略,该策略在计算合并收益时考虑到了空间局部性的变化。文中基于LLVM编译框架中的多面体编译模块Polly实现了这一策略,并选用Polybench等测试套件中的部分测试用例进行测试。实验结果表明,相较于现有的多种合并策略,在单核执行情况下,测试用例平均获得了14.9%~62.5%的性能提升;在多核执行情况下,多个测试用例平均获得了19.7%~94.9%的性能提升,在单个测试用例中最高获得了1.49x~3.07x的加速效果。

关键词: LLVM编译框架, 编译优化, 多面体模型, 数据局部性, 循环合并

Abstract: Existing polyhedral compilation tools often use some simple heuristic strategies to find the optimal loop fusion decisions.It is necessary to manually adjust the loop fusion strategy to get the best performance for different programs.To solve this problem,a fusion strategy based on data reuse analysis is proposed for multi-core CPU platform.This strategy avoids unnecessary fusion constraints that affecting the mining of data locality.For different stages of scheduling,the parallelism constraint for diffe-rent parallel levels is proposed.And a tiling constraint for CPU cache optimization is proposed for statements with complex array accesses.Compared with the previous loop fuion strategies,this strategy takes into account the changes in spatial locality when calculating the fusion profits.This strategy is implemented based on the polyhedral compilation module Polly in the LLVM compilation framework,and some test cases in test suites such as Polybench are selected for testing.In the case of single-core testing,compared with the existing fusion strategies,the average performance is improved by 14.9%~62.5%.In the case of multi-core testing,compared with the existing fusion strategies,the average performance is improved by 19.7%~94.9%,and the speedup is up to 1.49x~3.07x.

Key words: Compiler optimization, Data locality, LLVM, Loop fusion, Polyhedral model

中图分类号: 

  • TP314
[1]RAGAN-KELLEY J,ADAMS A,SHARLET D,et al.Halide:Decoupling Algorithms from Schedules for High-Performance Image Processing[J].Communications of the ACM,2018,61(1):106-115.
[2]CHEN T Q,MOREAU T.TVM:an automated end-to-end optimizing compiler for deep learning[C]//Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(OSDI'18).USENIX Association,2018:579-594.
[3]ZHAO J,LI Y Y,ZHAO R C.Black Magic of Polyhedral Compilation[J].Journal of Software,2018,29(8):2371-2396.
[4]BONDHUGULA U,HARTONO A,RAMANUJAM J,et al.A practical automatic polyhedral parallelizer and locality optimizer[J].ACM Sigplan Notices,2008,43(6):101-113.
[5]VERDOOLAEGE S,CARLOS JUEGA J,COHEN A,et al.Poly- hedral parallel code generation for CUDA[J].ACM Transactions on Architecture & Code Optimization,2013,9(4):1-23.
[6]POP S,COHEN A,CÉDRIC B,et al.GRAPHITE:Polyhedral analyses and optimizations for GCC[C]//Proceedings of the 2006 GCC Developers Summit.2006:179-197.
[7]POUCHET L N,GRLINGER A,ANDREAS S,et al.Polly-polyhedral optimization in LLVM[C]//International Workshop on Polyhedral Compilation Techniques(IMPACT).2011.
[8]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.The Next 700 Accelerated Layers:From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels,Automatically[J].ACM Transactions on Architecture and Code Optimization,2019,16(4):1-26.
[9]BAGHDADI R,RAY J,ROMDHANE M B,et al.Tiramisu:A Polyhedral Compiler for Expressing Fast and Portable Code[C]//Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization.IEEE Press,2019:193-205.
[10]LATTNER C,AMINI M,BONDHUGULA U,et al.MLIR: Scaling Compiler Infrastructure for Domain Specific Computation[C]//2021 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).ACM,2021.
[11]VERDOOLAEGE S.isl:An Integer Set Library for the Poly- hedral Model[C]//Proceedings of the Third International Congress Conference on Mathematical Software.Springer-Verlag,2010:299-302.
[12]DARTE A.Huard.Loop shifting for loop parallelization[R]. Technical Report RR2000-22:ENS Lyon,2000.
[13]QASEM A,KENNEDY K.Profitable loop fusion and tiling using model-driven empirical search[C]//Proceedings of the 20th Annual International Conference on Supercomputing.Association for Computing Machinery,2006:249-258.
[14]FEAUTRIER P.Some efficient solutions to the affine scheduling problem Part II Multidimensional time[J].International Journal of Parallel Programming,1992,21(6):389-420.
[15]LIM A W,LAM M S.Maximizing parallelism and minimizingsynchronization with affine transforms[C]//Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.Association for Computing Machinery,1997:201-214.
[16]LIM A W,CHEONG G I,LAM M S.An affine partitioning algorithm to maximize parallelism and minimize communication[C]//Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing.ACM Press,1999:228-237.
[17]BONDHUGULA U,BASKARAN M,KRISHNAMOORTHY S,et al.Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model[C]//Proceedings of the Joint European Conferences on Theory and Practice of Software 17th International Conference on Compiler Construction.Springer-Verlag,2008:132-146.
[18]BONDHUGULA U,GÜNLÜK O,DASH S,et al.A model for fusion and code motion in an automatic parallelizing compiler[C]//Proceedings of the 19th International Conference on Pa-rallel Architectures and Compilation Techniques.Association for Computing Machinery,2010:343-352.
[19]LOUIS-NOËL P,BONDHUGULA U,CÉDRIC B,et al.Loop Transformations:Convexity,Pruning and Optimization[J].ACM Sigplan Notices,2011,46(1):549-562.
[20]LOUIS-NOËL P,BONDHUGULA U,CÉDRIC B,et al.Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE Computer Society,2010:1-11.
[21]MEHTA S,LIN P H,YEW P C.Revisiting loop fusion in the polyhedral framework[C]//Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Association for Computing Machinery,2014:233-246.
[22]ZINENKO O,VERDOOLAEGE S,REDDY C,et al.Modeling the Conflicting Demands of Parallelism and Temporal/Spatial Locality in Affine Scheduling[C]//Proceedings of the 27th International Conference on Compiler Construction.Association for Computing Machinery.2018:3-13.
[23]VERDOOLAEGE S,GUELTON S,GROSSER T,et al.Sche- dule trees[J].Geodinamica Acta,2013,25(1/2):86-95.
[24]TOBIAS G,RAMANUJAM J,LOUIS N,et,al.Optimistic delinearization of parametrically sized arrays[C]//Proceedings of the 29th ACM on International Conference on Supercomputing.Association for Computing Machinery,2015:351-360.
[1] 陆浩松, 胡勇华, 王书盈, 周新莲, 李慧祥.
向量DSP的混合资源启发式循环展开因子选择方法研究
Study on Hybrid Resource Heuristic Loop Unrolling Factor Selection Method Based on Vector DSP
计算机科学, 2022, 49(6A): 777-783. https://doi.org/10.11896/jsjkx.210400146
[2] 唐镇, 胡勇华, 陆浩松, 王书盈.
基于弱约束指派的DSP寄存器偶对分配算法研究
Research on DSP Register Pairs Allocation Algorithm with Weak Assigning Constraints
计算机科学, 2021, 48(6A): 587-595. https://doi.org/10.11896/jsjkx.200600061
[3] 池昊宇, 陈长波.
基于神经网络的循环分块大小预测
Prediction of Loop Tiling Size Based on Neural Network
计算机科学, 2020, 47(8): 62-70. https://doi.org/10.11896/jsjkx.191200180
[4] 葛红美,徐超,陈念,廖希密.
一种面向嵌入式系统总线的低功耗优化方法
Low Power Optimization Method Oriented to Embedded System’s Bus
计算机科学, 2013, 40(12): 31-36.
[5] 李丽英,唐卓,李仁发.
基于LATE的Hadoop数据局部性改进调度算法
New Improvement of the Hadoop Relevant Data Locality Scheduling Algorithm Based on LATE
计算机科学, 2011, 38(11): 67-70.
[6] 田祖伟,孙光.
基于谓词代码的编译优化技术研究
Research of Compiler Optimization Technology Based on Predicated Code
计算机科学, 2010, 37(5): 130-133.
[7] .
基于汇编代码的指令调度器的设计与实现

计算机科学, 2009, 36(3): 45-47.
[8] .
数据流分析的关键技术研究

计算机科学, 2005, 32(12): 91-93.
[9] 连瑞琦 张兆庆.
低功耗编译的若干相关技术

计算机科学, 2004, 31(8): 164-167.
[10] 舒辉 康绯.
CME分析中的丢番图方程求解

计算机科学, 2002, 29(7): 149-151.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!