Computer Science ›› 2021, Vol. 48 ›› Issue (12): 49-58.doi: 10.11896/jsjkx.210200071

• Computer Architecture • Previous Articles     Next Articles

Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation

HU Wei-fang1,2, CHEN Yun1,2, LI Ying-ying3, SHANG Jian-dong1   

  1. 1 Supercomputing Center of Henan Province,Zhengzhou 450000,China
    2 School of Information Engineering,Zhengzhou University,Zhengzhou 450000,China
    3 PLA Strategic Support Force Information Engineering University,Zhengzhou 450000,China
  • Received:2021-02-07 Revised:2021-05-20 Online:2021-12-15 Published:2021-11-26
  • About author:HU Wei-fang,born in 1995,master,is a student member of China Computer Federation.His main research interests include advanced compiler technology and high performance computing.
    SHANG Jian-dong,born in 1968,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include data mining and high performance computing.

Abstract: Existing polyhedral compilation tools often use some simple heuristic strategies to find the optimal loop fusion decisions.It is necessary to manually adjust the loop fusion strategy to get the best performance for different programs.To solve this problem,a fusion strategy based on data reuse analysis is proposed for multi-core CPU platform.This strategy avoids unnecessary fusion constraints that affecting the mining of data locality.For different stages of scheduling,the parallelism constraint for diffe-rent parallel levels is proposed.And a tiling constraint for CPU cache optimization is proposed for statements with complex array accesses.Compared with the previous loop fuion strategies,this strategy takes into account the changes in spatial locality when calculating the fusion profits.This strategy is implemented based on the polyhedral compilation module Polly in the LLVM compilation framework,and some test cases in test suites such as Polybench are selected for testing.In the case of single-core testing,compared with the existing fusion strategies,the average performance is improved by 14.9%~62.5%.In the case of multi-core testing,compared with the existing fusion strategies,the average performance is improved by 19.7%~94.9%,and the speedup is up to 1.49x~3.07x.

Key words: Compiler optimization, Data locality, LLVM, Loop fusion, Polyhedral model

CLC Number: 

  • TP314
[1]RAGAN-KELLEY J,ADAMS A,SHARLET D,et al.Halide:Decoupling Algorithms from Schedules for High-Performance Image Processing[J].Communications of the ACM,2018,61(1):106-115.
[2]CHEN T Q,MOREAU T.TVM:an automated end-to-end optimizing compiler for deep learning[C]//Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(OSDI'18).USENIX Association,2018:579-594.
[3]ZHAO J,LI Y Y,ZHAO R C.Black Magic of Polyhedral Compilation[J].Journal of Software,2018,29(8):2371-2396.
[4]BONDHUGULA U,HARTONO A,RAMANUJAM J,et al.A practical automatic polyhedral parallelizer and locality optimizer[J].ACM Sigplan Notices,2008,43(6):101-113.
[5]VERDOOLAEGE S,CARLOS JUEGA J,COHEN A,et al.Poly- hedral parallel code generation for CUDA[J].ACM Transactions on Architecture & Code Optimization,2013,9(4):1-23.
[6]POP S,COHEN A,CÉDRIC B,et al.GRAPHITE:Polyhedral analyses and optimizations for GCC[C]//Proceedings of the 2006 GCC Developers Summit.2006:179-197.
[7]POUCHET L N,GRLINGER A,ANDREAS S,et al.Polly-polyhedral optimization in LLVM[C]//International Workshop on Polyhedral Compilation Techniques(IMPACT).2011.
[8]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.The Next 700 Accelerated Layers:From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels,Automatically[J].ACM Transactions on Architecture and Code Optimization,2019,16(4):1-26.
[9]BAGHDADI R,RAY J,ROMDHANE M B,et al.Tiramisu:A Polyhedral Compiler for Expressing Fast and Portable Code[C]//Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization.IEEE Press,2019:193-205.
[10]LATTNER C,AMINI M,BONDHUGULA U,et al.MLIR: Scaling Compiler Infrastructure for Domain Specific Computation[C]//2021 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).ACM,2021.
[11]VERDOOLAEGE S.isl:An Integer Set Library for the Poly- hedral Model[C]//Proceedings of the Third International Congress Conference on Mathematical Software.Springer-Verlag,2010:299-302.
[12]DARTE A.Huard.Loop shifting for loop parallelization[R]. Technical Report RR2000-22:ENS Lyon,2000.
[13]QASEM A,KENNEDY K.Profitable loop fusion and tiling using model-driven empirical search[C]//Proceedings of the 20th Annual International Conference on Supercomputing.Association for Computing Machinery,2006:249-258.
[14]FEAUTRIER P.Some efficient solutions to the affine scheduling problem Part II Multidimensional time[J].International Journal of Parallel Programming,1992,21(6):389-420.
[15]LIM A W,LAM M S.Maximizing parallelism and minimizingsynchronization with affine transforms[C]//Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.Association for Computing Machinery,1997:201-214.
[16]LIM A W,CHEONG G I,LAM M S.An affine partitioning algorithm to maximize parallelism and minimize communication[C]//Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing.ACM Press,1999:228-237.
[17]BONDHUGULA U,BASKARAN M,KRISHNAMOORTHY S,et al.Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model[C]//Proceedings of the Joint European Conferences on Theory and Practice of Software 17th International Conference on Compiler Construction.Springer-Verlag,2008:132-146.
[18]BONDHUGULA U,GÜNLÜK O,DASH S,et al.A model for fusion and code motion in an automatic parallelizing compiler[C]//Proceedings of the 19th International Conference on Pa-rallel Architectures and Compilation Techniques.Association for Computing Machinery,2010:343-352.
[19]LOUIS-NOËL P,BONDHUGULA U,CÉDRIC B,et al.Loop Transformations:Convexity,Pruning and Optimization[J].ACM Sigplan Notices,2011,46(1):549-562.
[20]LOUIS-NOËL P,BONDHUGULA U,CÉDRIC B,et al.Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE Computer Society,2010:1-11.
[21]MEHTA S,LIN P H,YEW P C.Revisiting loop fusion in the polyhedral framework[C]//Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Association for Computing Machinery,2014:233-246.
[22]ZINENKO O,VERDOOLAEGE S,REDDY C,et al.Modeling the Conflicting Demands of Parallelism and Temporal/Spatial Locality in Affine Scheduling[C]//Proceedings of the 27th International Conference on Compiler Construction.Association for Computing Machinery.2018:3-13.
[23]VERDOOLAEGE S,GUELTON S,GROSSER T,et al.Sche- dule trees[J].Geodinamica Acta,2013,25(1/2):86-95.
[24]TOBIAS G,RAMANUJAM J,LOUIS N,et,al.Optimistic delinearization of parametrically sized arrays[C]//Proceedings of the 29th ACM on International Conference on Supercomputing.Association for Computing Machinery,2015:351-360.
[1] LU Hao-song, HU Yong-hua, WANG Shu-ying, ZHOU Xin-lian, LI Hui-xiang. Study on Hybrid Resource Heuristic Loop Unrolling Factor Selection Method Based on Vector DSP [J]. Computer Science, 2022, 49(6A): 777-783.
[2] TANG Zhen, HU Yong-hua, LU Hao-song, WANG Shu-ying. Research on DSP Register Pairs Allocation Algorithm with Weak Assigning Constraints [J]. Computer Science, 2021, 48(6A): 587-595.
[3] CHEN Tao, SHU Hui, XIONG Xiao-bing. Study of Universal Shellcode Generation Technology [J]. Computer Science, 2021, 48(4): 288-294.
[4] HU Hao, SHEN Li, ZHOU Qing-lei and GONG Ling-qin. Node Fusion Optimization Method Based on LLVM Compiler [J]. Computer Science, 2020, 47(6A): 561-566.
[5] WANG Yue-feng and WANG Xi-bo. Design of Local Scheduling Algorithm for Integrated Preemptive Scheduling Policy in Hadoop Cluster Environment [J]. Computer Science, 2017, 44(Z6): 567-570.
[6] ZHANG Qi-liang, ZHANG Yu and ZHOU Kun. CCodeExtractor:Automatic Approach of Function Extraction for C Programs [J]. Computer Science, 2017, 44(4): 16-20.
[7] LI Hang-chen, QIN Xiao-lin and SHEN Yao. Load Balancing Strategy on MapReduce with Locality-aware [J]. Computer Science, 2015, 42(10): 50-56.
[8] GE Hong-mei,XU Chao,CHEN Nian and LIAO Xi-mi. Low Power Optimization Method Oriented to Embedded System’s Bus [J]. Computer Science, 2013, 40(12): 31-36.
[9] CHU Ya,MA Ting-huai and ZHAO Li-cheng. Cloud Computing Resource Scheduling:Policy and Algorithm [J]. Computer Science, 2013, 40(11): 8-13.
[10] GU Yu ,ZHOU Liang , DING Qiu-lin. Research of Three-Queue Scheduling Algorithms Based on Priority [J]. Computer Science, 2011, 38(Z10): 253-256.
[11] TIAN Zu-wei,SUN Guang. Research of Compiler Optimization Technology Based on Predicated Code [J]. Computer Science, 2010, 37(5): 130-133.
[12] TANG Wei, WU Cheng-Yong, ZHANG Zhao-Qing (Institute of Computing Technology,Chinese Academy of Sciences, Beijing 100080). [J]. Computer Science, 2006, 33(4): 250-252.
[13] . [J]. Computer Science, 2006, 33(2): 257-262.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!