计算机科学 ›› 2016, Vol. 43 ›› Issue (1): 226-231.doi: 10.11896/j.issn.1002-137X.2016.01.049

• 软件与数据库技术 • 上一篇    下一篇

循环展开技术在向量程序中的应用

高伟,赵荣彩,于海宁,张庆花   

  1. 信息工程大学 郑州450001 数学工程与先进计算国家重点实验室 无锡214125,信息工程大学 郑州450001 数学工程与先进计算国家重点实验室 无锡214125,信息工程大学 郑州450001 数学工程与先进计算国家重点实验室 无锡214125,信息工程大学 郑州450001 数学工程与先进计算国家重点实验室 无锡214125
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受“核高基”国家科技重大专项(2009ZX01036-001-001-2),数学工程与先进计算国家重点实验室开放课题(2013A11)资助

Loop Unrolling in Vectorized Programs

GAO Wei, ZHAO Rong-cai, YU Hai-ning and ZHANG Qing-hua   

  • Online:2018-12-01 Published:2018-12-01

摘要: 循环展开是一项常用的循环优化技术。当前针对串行程序的循环展开技术已经比较成熟,但是在实际应用中没有针对向量程序进行有效的循环展开。为了解决这个问题,提出了一种面向向量程序的循环展开技术。首先,针对向量寄存器压力和代码膨胀等限制因素,提出了一种自动计算展开因子的CUFVL算法;其次,根据向量循环展开的特点,制定了完全展开策略;最后结合CUFVL算法和完全展开策略,设计了向量循环展开的总体流程。实验结果表明,该方案能够计算出合适的展开因子,进而对向量程序进行适当的循环展开或完全展开,从而有效提升应用程序的性能。

关键词: 向量程序,循环展开,展开因子,完全展开

Abstract: Loop unrolling is a loop transformation that has been developed well and widely used in serial programs.However,it usually loses its efficiency in practical vectorized applications.To solve this problem,this paper proposed a loop unrolling technique for vectorized loops.First,we presented the CUFVL algorithm to automatically determine the unroll factors via taking the register pressure and code explosion problems into account.Second,we designed a complete unrolling strategy according to the specific characters of vectorized loops.Finally,we showed the flow chart of the proposed technique based on the CUFVL algorithm and the complete unrolling strategy.Experimental results show that the proposed technique can find out the most suitable unroll factor for the programs used in the experiments,thereby enhancing their performance efficiently.

Key words: Vectorized programs,Loop unrolling,Unroll factor,Complete unrolling

[1] Sarkar V.Optimized unrolling of nested loops[C]∥Proc.of the 14th Int’l Conf.on Supercomputing.New Mexico:ACM Press,2000
[2] Li W L,Liu L,Tang Z Z.Loop unrolling optimization for software pipelining[J].Journal of Beijing University of Aeronautics and Astronautics,2004,30(11):1111-1115(in Chinese)李文龙,刘利,汤志忠.软件流水中的循环展开优化[J].北京航空航天大学学报,2004,30(11):1111-1115
[3] Lam Y M,Coutinho J G F,Luk W,et al.Unrolling-based loop mapping and scheduling[C]∥2008 International Conference on Field-Programmable Technolugy(ICFPT’2008).Taipei,Taiwan,2008:321-324
[4] Callahan D,Kennedy K,Porterfield A.Software prefetching[C]∥Proc.of the 4th Int’l Conf.on Architectural Support for Programming Languages and Operating Systems.ACM Press,1991
[5] Zhu J H.Research on SIMD Compiling Optimization Techniques[D].Shanghai:Fudan University,2005(in Chinese)朱嘉华.SIMD编译优化技术研究[D].上海:复旦大学,2005
[6] Jiang W H,Mei C,Guo Y.Vectorization for Real-life MultimediaApplications on Processors’ Multimedia Extensions [J].Chinese Journal of Computers,2005,28(8):1254-1266(in Chinese) 姜伟华,梅超,郭一.一种针对多媒体扩展指令集和实际多媒体程序的自动向量化方法[J].计算机学报,2005,28(8):1254-1266
[7] Zhang W H,Zang B Y.Research on SIMD Compiling Optimization Techniques [J].Communications of the China Computer Federation(CCCF),2007,3(2):27-36(in Chinese)张为华,臧斌宇.SIMD编译优化技术研究概述[J].中国计算机学会通讯,2007,3(2):27-36
[8] Intel Corp.Intel C/C++ and Intel Fortran Compilers for Linux[EB/OL].http://www.intel.com/software/products/compilers
[9] The GNU Compiler Collection[EB/OL].http://gcc.gnu.org
[10] Pathscale Compiler User’s Guide[EB/OL].http://www.pathscale.com
[11] http://sourceforge.net/projects/open64/files/open64/Open64-5.0
[12] Bacon D F,Graham S L,Shap O J.Compiler Transformations for High-Performance Computing[J].ACM Computing Surveys,1994,26(4):345-420
[13] Alexander M J,Bailey M W,Childers B R,et al.Memory bandwidth optimizations for wide-bus machines[C]∥Proceedings of the 26th Hawaii International Conference on System Sciences.Wailea,Hawaii,1993:466-475
[14] Mowry T C.Tolerating Latency Through Software-ControlledData Prefetching[D].Stanford University,March 1994
[15] Fog A.Optimizing subroutines in assembly language[D].Co-penhagen University College of Engineering,September 2012
[16] Ma Y,Carr S.Register Pressure Guided Unroll-and-Jam[M]∥The 2008 Open64 Workshop.Boston,MA,USA,April 6,2008
[17] Carr S,Guan Y.Unroll and Jam using Uniformly Generated Sets[C]∥Proceedings of the 30th Annual International Symposium on Microarchitecture (MICRO-30).1997:349-357
[18] Carr S,Kennedy K.Improving the Ratio of Memory Operations to Floating-Point Operations in Loops[J].ACM Transactions on Programming Languages and Systems (ToPLaS),1994,16(6):1768-1810
[19] Mark S,Saman A.Predicting Unroll Factors Using SupervisedClassification[C]∥Proceedings of the International Symposium on Code Generation and Optimization(CGO).2003:204-215
[20] Monsifrot A,Bodin F,Quiniou R.A Machine Learning Ap-proach to Automatic Production of Compiler Heuristics[M]∥Artificial Intelligence:Methodology,Systems,Applications.2002:41-50
[21] Mowry T C,Lam M S,Gupta A.Design and evaluation of a compiler algorithm for prefetching[C]∥Proceeding of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.Massachusetts:ACM Press,1992:62-73

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!