计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 49-58.doi: 10.11896/jsjkx.210200071

• 计算机体系结构* 上一篇    下一篇

基于数据重用分析的多面体循环合并策略

胡伟方1,2, 陈云1,2, 李颖颖3, 商建东1   

  1. 1 河南省超级计算中心 郑州450000
    2 郑州大学信息工程学院 郑州450000
    3 战略支援部队信息工程大学 郑州450000
  • 收稿日期:2021-02-07 修回日期:2021-05-20 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 商建东(shangjiandong@zzu.edu.cn)
  • 作者简介:iewfhu@foxmail.com

Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation

HU Wei-fang1,2, CHEN Yun1,2, LI Ying-ying3, SHANG Jian-dong1   

  1. 1 Supercomputing Center of Henan Province,Zhengzhou 450000,China
    2 School of Information Engineering,Zhengzhou University,Zhengzhou 450000,China
    3 PLA Strategic Support Force Information Engineering University,Zhengzhou 450000,China
  • Received:2021-02-07 Revised:2021-05-20 Online:2021-12-15 Published:2021-11-26
  • About author:HU Wei-fang,born in 1995,master,is a student member of China Computer Federation.His main research interests include advanced compiler technology and high performance computing.
    SHANG Jian-dong,born in 1968,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include data mining and high performance computing.

摘要: 现有多面体编译工具往往使用一些简单的启发式策略来寻找最优的语句合并,对于不同的待优化程序,需要手工调整循环合并策略以获得最佳性能。针对这一问题,面向多核CPU目标平台,文中提出了一种基于数据重用分析的循环合并策略。该策略避免了不必要的且会影响数据局部性利用的合并限制:针对调度的不同阶段,提出了面向不同并行层次的并行性合并限制;对于数组访问关系较为复杂的语句,提出了面向CPU高速缓存优化的分块性合并限制。相较于以往的合并策略,该策略在计算合并收益时考虑到了空间局部性的变化。文中基于LLVM编译框架中的多面体编译模块Polly实现了这一策略,并选用Polybench等测试套件中的部分测试用例进行测试。实验结果表明,相较于现有的多种合并策略,在单核执行情况下,测试用例平均获得了14.9%~62.5%的性能提升;在多核执行情况下,多个测试用例平均获得了19.7%~94.9%的性能提升,在单个测试用例中最高获得了1.49x~3.07x的加速效果。

关键词: 编译优化, 多面体模型, LLVM编译框架, 循环合并, 数据局部性

Abstract: Existing polyhedral compilation tools often use some simple heuristic strategies to find the optimal loop fusion decisions.It is necessary to manually adjust the loop fusion strategy to get the best performance for different programs.To solve this problem,a fusion strategy based on data reuse analysis is proposed for multi-core CPU platform.This strategy avoids unnecessary fusion constraints that affecting the mining of data locality.For different stages of scheduling,the parallelism constraint for diffe-rent parallel levels is proposed.And a tiling constraint for CPU cache optimization is proposed for statements with complex array accesses.Compared with the previous loop fuion strategies,this strategy takes into account the changes in spatial locality when calculating the fusion profits.This strategy is implemented based on the polyhedral compilation module Polly in the LLVM compilation framework,and some test cases in test suites such as Polybench are selected for testing.In the case of single-core testing,compared with the existing fusion strategies,the average performance is improved by 14.9%~62.5%.In the case of multi-core testing,compared with the existing fusion strategies,the average performance is improved by 19.7%~94.9%,and the speedup is up to 1.49x~3.07x.

Key words: Compiler optimization, Polyhedral model, LLVM, Loop fusion, Data locality

中图分类号: 

  • TP314
[1]RAGAN-KELLEY J,ADAMS A,SHARLET D,et al.Halide:Decoupling Algorithms from Schedules for High-Performance Image Processing[J].Communications of the ACM,2018,61(1):106-115.
[2]CHEN T Q,MOREAU T.TVM:an automated end-to-end optimizing compiler for deep learning[C]//Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(OSDI'18).USENIX Association,2018:579-594.
[3]ZHAO J,LI Y Y,ZHAO R C.Black Magic of Polyhedral Compilation[J].Journal of Software,2018,29(8):2371-2396.
[4]BONDHUGULA U,HARTONO A,RAMANUJAM J,et al.A practical automatic polyhedral parallelizer and locality optimizer[J].ACM Sigplan Notices,2008,43(6):101-113.
[5]VERDOOLAEGE S,CARLOS JUEGA J,COHEN A,et al.Poly- hedral parallel code generation for CUDA[J].ACM Transactions on Architecture & Code Optimization,2013,9(4):1-23.
[6]POP S,COHEN A,CÉDRIC B,et al.GRAPHITE:Polyhedral analyses and optimizations for GCC[C]//Proceedings of the 2006 GCC Developers Summit.2006:179-197.
[7]POUCHET L N,GRLINGER A,ANDREAS S,et al.Polly-polyhedral optimization in LLVM[C]//International Workshop on Polyhedral Compilation Techniques(IMPACT).2011.
[8]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.The Next 700 Accelerated Layers:From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels,Automatically[J].ACM Transactions on Architecture and Code Optimization,2019,16(4):1-26.
[9]BAGHDADI R,RAY J,ROMDHANE M B,et al.Tiramisu:A Polyhedral Compiler for Expressing Fast and Portable Code[C]//Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization.IEEE Press,2019:193-205.
[10]LATTNER C,AMINI M,BONDHUGULA U,et al.MLIR: Scaling Compiler Infrastructure for Domain Specific Computation[C]//2021 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).ACM,2021.
[11]VERDOOLAEGE S.isl:An Integer Set Library for the Poly- hedral Model[C]//Proceedings of the Third International Congress Conference on Mathematical Software.Springer-Verlag,2010:299-302.
[12]DARTE A.Huard.Loop shifting for loop parallelization[R]. Technical Report RR2000-22:ENS Lyon,2000.
[13]QASEM A,KENNEDY K.Profitable loop fusion and tiling using model-driven empirical search[C]//Proceedings of the 20th Annual International Conference on Supercomputing.Association for Computing Machinery,2006:249-258.
[14]FEAUTRIER P.Some efficient solutions to the affine scheduling problem Part II Multidimensional time[J].International Journal of Parallel Programming,1992,21(6):389-420.
[15]LIM A W,LAM M S.Maximizing parallelism and minimizingsynchronization with affine transforms[C]//Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.Association for Computing Machinery,1997:201-214.
[16]LIM A W,CHEONG G I,LAM M S.An affine partitioning algorithm to maximize parallelism and minimize communication[C]//Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing.ACM Press,1999:228-237.
[17]BONDHUGULA U,BASKARAN M,KRISHNAMOORTHY S,et al.Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model[C]//Proceedings of the Joint European Conferences on Theory and Practice of Software 17th International Conference on Compiler Construction.Springer-Verlag,2008:132-146.
[18]BONDHUGULA U,GÜNLÜK O,DASH S,et al.A model for fusion and code motion in an automatic parallelizing compiler[C]//Proceedings of the 19th International Conference on Pa-rallel Architectures and Compilation Techniques.Association for Computing Machinery,2010:343-352.
[19]LOUIS-NOËL P,BONDHUGULA U,CÉDRIC B,et al.Loop Transformations:Convexity,Pruning and Optimization[J].ACM Sigplan Notices,2011,46(1):549-562.
[20]LOUIS-NOËL P,BONDHUGULA U,CÉDRIC B,et al.Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE Computer Society,2010:1-11.
[21]MEHTA S,LIN P H,YEW P C.Revisiting loop fusion in the polyhedral framework[C]//Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Association for Computing Machinery,2014:233-246.
[22]ZINENKO O,VERDOOLAEGE S,REDDY C,et al.Modeling the Conflicting Demands of Parallelism and Temporal/Spatial Locality in Affine Scheduling[C]//Proceedings of the 27th International Conference on Compiler Construction.Association for Computing Machinery.2018:3-13.
[23]VERDOOLAEGE S,GUELTON S,GROSSER T,et al.Sche- dule trees[J].Geodinamica Acta,2013,25(1/2):86-95.
[24]TOBIAS G,RAMANUJAM J,LOUIS N,et,al.Optimistic delinearization of parametrically sized arrays[C]//Proceedings of the 29th ACM on International Conference on Supercomputing.Association for Computing Machinery,2015:351-360.
[1] 唐镇, 胡勇华, 陆浩松, 王书盈. 基于弱约束指派的DSP寄存器偶对分配算法研究[J]. 计算机科学, 2021, 48(6A): 587-595.
[2] 池昊宇, 陈长波. 基于神经网络的循环分块大小预测[J]. 计算机科学, 2020, 47(8): 62-70.
[3] 葛红美,徐超,陈念,廖希密. 一种面向嵌入式系统总线的低功耗优化方法[J]. 计算机科学, 2013, 40(12): 31-36.
[4] 李丽英,唐卓,李仁发. 基于LATE的Hadoop数据局部性改进调度算法[J]. 计算机科学, 2011, 38(11): 67-70.
[5] 田祖伟,孙光. 基于谓词代码的编译优化技术研究[J]. 计算机科学, 2010, 37(5): 130-133.
[6] . 基于汇编代码的指令调度器的设计与实现[J]. 计算机科学, 2009, 36(3): 45-47.
[7] . 数据流分析的关键技术研究[J]. 计算机科学, 2005, 32(12): 91-93.
[8] 连瑞琦 张兆庆. 低功耗编译的若干相关技术[J]. 计算机科学, 2004, 31(8): 164-167.
[9] 舒辉 康绯. CME分析中的丢番图方程求解[J]. 计算机科学, 2002, 29(7): 149-151.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张广梅,李景霞. 利用双向推导检测Java程序中的内存泄漏[J]. 计算机科学, 2014, 41(Z6): 438 -441 .
[2] 李专 王元珍. 基于RDF的XML安全推理控制[J]. 计算机科学, 2007, 34(4): 104 -105 .
[3] 王珺, 朱志伟, 刘俊杰. 一种针对无线传感网中黑洞攻击的检测与防御方法[J]. 计算机科学, 2019, 46(2): 102 -108 .
[4] 潘孝勤, 芦天亮, 杜彦辉, 仝鑫. 基于深度学习的语音合成与转换技术综述[J]. 计算机科学, 2021, 48(8): 200 -208 .
[5] 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究[J]. 计算机科学, 2021, 48(9): 36 -42 .
[6] 余力, 杜启翰, 岳博妍, 向君瑶, 徐冠宇, 冷友方. 基于强化学习的推荐研究综述[J]. 计算机科学, 2021, 48(10): 1 -18 .
[7] 王梓强, 胡晓光, 李晓筱, 杜卓群. 移动机器人全局路径规划算法综述[J]. 计算机科学, 2021, 48(10): 19 -29 .
[8] 高洪皓, 郑子彬, 殷昱煜, 丁勇. 区块链技术专题序言[J]. 计算机科学, 2021, 48(11): 1 -3 .
[9] 毛瀚宇, 聂铁铮, 申德荣, 于戈, 徐石成, 何光宇. 区块链即服务平台关键技术及发展综述[J]. 计算机科学, 2021, 48(11): 4 -11 .
[10] 张倩, 肖丽. 基于流线的流场可视化绘制方法综述[J]. 计算机科学, 2021, 48(12): 1 -7 .