计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 49-58.doi: 10.11896/jsjkx.210200071
胡伟方1,2, 陈云1,2, 李颖颖3, 商建东1
HU Wei-fang1,2, CHEN Yun1,2, LI Ying-ying3, SHANG Jian-dong1
摘要: 现有多面体编译工具往往使用一些简单的启发式策略来寻找最优的语句合并,对于不同的待优化程序,需要手工调整循环合并策略以获得最佳性能。针对这一问题,面向多核CPU目标平台,文中提出了一种基于数据重用分析的循环合并策略。该策略避免了不必要的且会影响数据局部性利用的合并限制:针对调度的不同阶段,提出了面向不同并行层次的并行性合并限制;对于数组访问关系较为复杂的语句,提出了面向CPU高速缓存优化的分块性合并限制。相较于以往的合并策略,该策略在计算合并收益时考虑到了空间局部性的变化。文中基于LLVM编译框架中的多面体编译模块Polly实现了这一策略,并选用Polybench等测试套件中的部分测试用例进行测试。实验结果表明,相较于现有的多种合并策略,在单核执行情况下,测试用例平均获得了14.9%~62.5%的性能提升;在多核执行情况下,多个测试用例平均获得了19.7%~94.9%的性能提升,在单个测试用例中最高获得了1.49x~3.07x的加速效果。
中图分类号:
[1]RAGAN-KELLEY J,ADAMS A,SHARLET D,et al.Halide:Decoupling Algorithms from Schedules for High-Performance Image Processing[J].Communications of the ACM,2018,61(1):106-115. [2]CHEN T Q,MOREAU T.TVM:an automated end-to-end optimizing compiler for deep learning[C]//Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(OSDI'18).USENIX Association,2018:579-594. [3]ZHAO J,LI Y Y,ZHAO R C.Black Magic of Polyhedral Compilation[J].Journal of Software,2018,29(8):2371-2396. [4]BONDHUGULA U,HARTONO A,RAMANUJAM J,et al.A practical automatic polyhedral parallelizer and locality optimizer[J].ACM Sigplan Notices,2008,43(6):101-113. [5]VERDOOLAEGE S,CARLOS JUEGA J,COHEN A,et al.Poly- hedral parallel code generation for CUDA[J].ACM Transactions on Architecture & Code Optimization,2013,9(4):1-23. [6]POP S,COHEN A,CÉDRIC B,et al.GRAPHITE:Polyhedral analyses and optimizations for GCC[C]//Proceedings of the 2006 GCC Developers Summit.2006:179-197. [7]POUCHET L N,GRLINGER A,ANDREAS S,et al.Polly-polyhedral optimization in LLVM[C]//International Workshop on Polyhedral Compilation Techniques(IMPACT).2011. [8]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.The Next 700 Accelerated Layers:From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels,Automatically[J].ACM Transactions on Architecture and Code Optimization,2019,16(4):1-26. [9]BAGHDADI R,RAY J,ROMDHANE M B,et al.Tiramisu:A Polyhedral Compiler for Expressing Fast and Portable Code[C]//Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization.IEEE Press,2019:193-205. [10]LATTNER C,AMINI M,BONDHUGULA U,et al.MLIR: Scaling Compiler Infrastructure for Domain Specific Computation[C]//2021 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).ACM,2021. [11]VERDOOLAEGE S.isl:An Integer Set Library for the Poly- hedral Model[C]//Proceedings of the Third International Congress Conference on Mathematical Software.Springer-Verlag,2010:299-302. [12]DARTE A.Huard.Loop shifting for loop parallelization[R]. Technical Report RR2000-22:ENS Lyon,2000. [13]QASEM A,KENNEDY K.Profitable loop fusion and tiling using model-driven empirical search[C]//Proceedings of the 20th Annual International Conference on Supercomputing.Association for Computing Machinery,2006:249-258. [14]FEAUTRIER P.Some efficient solutions to the affine scheduling problem Part II Multidimensional time[J].International Journal of Parallel Programming,1992,21(6):389-420. [15]LIM A W,LAM M S.Maximizing parallelism and minimizingsynchronization with affine transforms[C]//Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.Association for Computing Machinery,1997:201-214. [16]LIM A W,CHEONG G I,LAM M S.An affine partitioning algorithm to maximize parallelism and minimize communication[C]//Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing.ACM Press,1999:228-237. [17]BONDHUGULA U,BASKARAN M,KRISHNAMOORTHY S,et al.Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model[C]//Proceedings of the Joint European Conferences on Theory and Practice of Software 17th International Conference on Compiler Construction.Springer-Verlag,2008:132-146. [18]BONDHUGULA U,GÜNLÜK O,DASH S,et al.A model for fusion and code motion in an automatic parallelizing compiler[C]//Proceedings of the 19th International Conference on Pa-rallel Architectures and Compilation Techniques.Association for Computing Machinery,2010:343-352. [19]LOUIS-NOËL P,BONDHUGULA U,CÉDRIC B,et al.Loop Transformations:Convexity,Pruning and Optimization[J].ACM Sigplan Notices,2011,46(1):549-562. [20]LOUIS-NOËL P,BONDHUGULA U,CÉDRIC B,et al.Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE Computer Society,2010:1-11. [21]MEHTA S,LIN P H,YEW P C.Revisiting loop fusion in the polyhedral framework[C]//Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Association for Computing Machinery,2014:233-246. [22]ZINENKO O,VERDOOLAEGE S,REDDY C,et al.Modeling the Conflicting Demands of Parallelism and Temporal/Spatial Locality in Affine Scheduling[C]//Proceedings of the 27th International Conference on Compiler Construction.Association for Computing Machinery.2018:3-13. [23]VERDOOLAEGE S,GUELTON S,GROSSER T,et al.Sche- dule trees[J].Geodinamica Acta,2013,25(1/2):86-95. [24]TOBIAS G,RAMANUJAM J,LOUIS N,et,al.Optimistic delinearization of parametrically sized arrays[C]//Proceedings of the 29th ACM on International Conference on Supercomputing.Association for Computing Machinery,2015:351-360. |
[1] | 陆浩松, 胡勇华, 王书盈, 周新莲, 李慧祥. 向量DSP的混合资源启发式循环展开因子选择方法研究 Study on Hybrid Resource Heuristic Loop Unrolling Factor Selection Method Based on Vector DSP 计算机科学, 2022, 49(6A): 777-783. https://doi.org/10.11896/jsjkx.210400146 |
[2] | 唐镇, 胡勇华, 陆浩松, 王书盈. 基于弱约束指派的DSP寄存器偶对分配算法研究 Research on DSP Register Pairs Allocation Algorithm with Weak Assigning Constraints 计算机科学, 2021, 48(6A): 587-595. https://doi.org/10.11896/jsjkx.200600061 |
[3] | 池昊宇, 陈长波. 基于神经网络的循环分块大小预测 Prediction of Loop Tiling Size Based on Neural Network 计算机科学, 2020, 47(8): 62-70. https://doi.org/10.11896/jsjkx.191200180 |
[4] | 葛红美,徐超,陈念,廖希密. 一种面向嵌入式系统总线的低功耗优化方法 Low Power Optimization Method Oriented to Embedded System’s Bus 计算机科学, 2013, 40(12): 31-36. |
[5] | 李丽英,唐卓,李仁发. 基于LATE的Hadoop数据局部性改进调度算法 New Improvement of the Hadoop Relevant Data Locality Scheduling Algorithm Based on LATE 计算机科学, 2011, 38(11): 67-70. |
[6] | 田祖伟,孙光. 基于谓词代码的编译优化技术研究 Research of Compiler Optimization Technology Based on Predicated Code 计算机科学, 2010, 37(5): 130-133. |
[7] | . 基于汇编代码的指令调度器的设计与实现 计算机科学, 2009, 36(3): 45-47. |
[8] | . 数据流分析的关键技术研究 计算机科学, 2005, 32(12): 91-93. |
[9] | 连瑞琦 张兆庆. 低功耗编译的若干相关技术 计算机科学, 2004, 31(8): 164-167. |
[10] | 舒辉 康绯. CME分析中的丢番图方程求解 计算机科学, 2002, 29(7): 149-151. |
|