计算机科学 ›› 2022, Vol. 49 ›› Issue (6A): 777-783.doi: 10.11896/jsjkx.210400146

• 交叉&应用 • 上一篇    下一篇

向量DSP的混合资源启发式循环展开因子选择方法研究

陆浩松, 胡勇华, 王书盈, 周新莲, 李慧祥   

  1. 湖南科技大学计算机科学与工程学院 湖南 湘潭 411201
  • 出版日期:2022-06-10 发布日期:2022-06-08
  • 通讯作者: 胡勇华(yonghuahu@hnust.edu.cn)
  • 作者简介:(webdfbxg@163.com)
  • 基金资助:
    湖南省教育厅科研项目(20B242,19A169);湖南省自然科学基金(2017JJ3087);国家自然科学基金(61872138)

Study on Hybrid Resource Heuristic Loop Unrolling Factor Selection Method Based on Vector DSP

LU Hao-song, HU Yong-hua, WANG Shu-ying, ZHOU Xin-lian, LI Hui-xiang   

  1. School of Computer Science and Engineering,Hunan University of Science and Technology,Xiangtan,Hunan 411201,China
  • Online:2022-06-10 Published:2022-06-08
  • About author:LU Hao-song,born in 1995,postgra-duate.His main research interests include DSP compilation and code optimization technology.
    HU Yong-hua,born in 1981,Ph.D,professor,Ph.D supervisor.His main research interests include DSP compilation and code optimization technology.
  • Supported by:
    Research Projects of Hunan Provincial Department of Education(20B242,19A169),Natural Science Foundation of Hunan Province,China(2017JJ3087) and National Natural Science Foundation of China(61872138).

摘要: 在现代处理器中,具有向量处理单元的VLIW体系结构逐渐成为高性能DSP体系结构的典型代表。基于这类体系结构的寄存器资源丰富、执行单元多等特点,研究了相应的循环展开因子选择问题,提出了一种循环展开因子选择方法来提升循环展开这种重要优化的效果。该方法考虑了循环体代码的向量标量属性、基址寄存器和索引寄存器资源使用规则等因素,并且在展开因子选择算法中增加了执行单元使用占比和展开因子按幂次对齐这两种启发式因素。针对3种常用数字信息处理算法开展了实验研究,实验结果表明了该方法的有效性。对于这三种DSP算法,用所提方法获得的循环展开因子进行循环展开处理后,它们的平均性能相比已有方法提升了10%以上。

关键词: 编译优化, 超长指令字, 向量DSP, 循环展开, 展开因子

Abstract: For modern microprocessors,the very long instruction word(VLIW) architecture integrating vector processing units has gradually become a typical representative of high-performance digital signal processor(DSP) architectures.This architecture is mainly characterized by rich register resources and many instruction execution units.Based on these characteristics,a selection method for the corresponding loop unrolling factor is proposed to improve the effect of loop unrolling optimization.This method takes into account the vector or scalar attribute of the code in a loop body,and the usage rules of base address registers and index registers.Moreover,another two heuristics,i.e.,the proportion of the times that the execution units are used and the power alignment of unrolling factor,are used in the loop unrolling factor selection algorithm.The ability of this method in developing more instruction level parallelism is proved by experiments performed on three commonly used digital signal processing algorithms.Experiment results show that the average performance of the algorithms improves by more than 10% compared with the existing methods.In particular,experiments on FFT algorithm show that the proposed method can analyze the usage of related hardware resources more accurately through the hybrid resource heuristics,and makes the judgment of unrolling and obtains the corresponding value of loop unrolling factor.

Key words: Compiler optimization, Loop unrolling, Unrolling factor, Vector DSP, VLIW

中图分类号: 

  • TP314
[1] LEE Y,AVIZIENIS R,BISHARA A,et al.Exploring thetradeoffs between programmability and efficiency in data-parallel accelerators[J].ACM Sigarch Computer Architecture News,2011,39(3):129-140.
[2] LIN C C,GU N J,LEI Y M,et al.SIMD compiler optimization of clustered VLIW DSP[J].Journal of University of Science and Technology of China,2011(8):53-59.
[3] HE G Q,CHEN Y.Research on Huarui DSP software architecture[J].Modern Radar,2016(9):17-22.
[4] BLAKE G,DRESLINSKI R G,MUDGE T.A survey of mul-ticore processors[J].IEEE Signal Processing Magazine,2009,26(6):26-37.
[5] WOH M,SEO S,MAHLKE S A,et al.AnySP:Anytime Anywhere Anyway Signal Processing[C]//36th International Symposium on Computer Architecture(ISCA 2009).Austin,TX,USA.ACM,2009:20-24.
[6] ROWEN C,DAN N,RAVINDRAN R,et al.The world's fastest DSP core:Breaking the 100 GMAC/s barrier[C]//2011 IEEE Hot Chips 23 Symposium(HCS).IEEE,2011.
[7] CHEN S M,LIU S,WAN J H,et al.Architecture and implementation of collaborative multi-core DSP YHFT qmbase[J].Chinese Science:Information Science,2015,45(4):560-573.
[8] ZHOU N,WANG R,QIN Y Y,et al.Design and implementation of inter core communication mechanism based on heterogeneous multi-core environment[J].Computer Engineering and Design,2019,40(3):294-300,308.
[9] PADUA D A.Advanced compiler optimizations for supercomputers[J].Communications of the ACM,1986,29(12):1184-1201.
[10] SHIVAM A,WATKINSON N,NICOLAU A,et al.Towards an Achievable Performance for the Loop Nests[J].arXiv:1902.00603,2019.
[11] CUI Y Z,LIU S,WANG Q,et al.Study on cyclic optimization technique of lattice Boltzmann method[J].Acta Computa Sinica,2020,450(6):116-132.
[12] WEISS S,SMITH J E.A study of scalar compilation techniques for pipelined supercomputers[J].ACM Sigarch Computer Architecture News,1990,15(5):105-109.
[13] SARKAR V.Optimized Unrolling of Nested Loops[J].International Journal of Parallel Programming,2001,29(5):545-581.
[14] CARR S,GUAN Y.Unroll-and-jam using uniformly generated sets[C]//Proceedings of 30th Annual International Symposium on Microarchitecture.IEEE,1997:349-357.
[15] CARR S M, KENNEDY K W.Improving the ratio of memory operations to floating-point operations in loops[J].ACM Tran-sactions on Programming Languages & Systems,1994,16(6):1768-1810.
[16] LI W L,LIU L,TANG Z Z.Optimization of loop unfolding in software pipeline[J].Journal of Beijing University of Aeronautics and Astronautics,2004,30(11):1111-1115.
[17] HUANG Y B,LI C J.Implementation of tail loop Vectorization Based on llvm[C]//Proceedings of the 20th Annual Conference of Computer Engineering and Technology and the 6th Microprocessor Technology Forum.Computer Society of China,2016.
[1] 唐镇, 胡勇华, 陆浩松, 王书盈.
基于弱约束指派的DSP寄存器偶对分配算法研究
Research on DSP Register Pairs Allocation Algorithm with Weak Assigning Constraints
计算机科学, 2021, 48(6A): 587-595. https://doi.org/10.11896/jsjkx.200600061
[2] 胡伟方, 陈云, 李颖颖, 商建东.
基于数据重用分析的多面体循环合并策略
Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation
计算机科学, 2021, 48(12): 49-58. https://doi.org/10.11896/jsjkx.210200071
[3] 池昊宇, 陈长波.
基于神经网络的循环分块大小预测
Prediction of Loop Tiling Size Based on Neural Network
计算机科学, 2020, 47(8): 62-70. https://doi.org/10.11896/jsjkx.191200180
[4] 高伟,赵荣彩,于海宁,张庆花.
循环展开技术在向量程序中的应用
Loop Unrolling in Vectorized Programs
计算机科学, 2016, 43(1): 226-231. https://doi.org/10.11896/j.issn.1002-137X.2016.01.049
[5] 刘鹏,赵荣彩,赵博,高伟.
一种面向SIMD扩展部件的向量化统一架构
Unified Vectorization Framework for SIMD Extensions
计算机科学, 2014, 41(9): 28-31. https://doi.org/10.11896/j.issn.1002-137X.2014.09.004
[6] 葛红美,徐超,陈念,廖希密.
一种面向嵌入式系统总线的低功耗优化方法
Low Power Optimization Method Oriented to Embedded System’s Bus
计算机科学, 2013, 40(12): 31-36.
[7] 田祖伟,孙光.
基于谓词代码的编译优化技术研究
Research of Compiler Optimization Technology Based on Predicated Code
计算机科学, 2010, 37(5): 130-133.
[8] .
基于汇编代码的指令调度器的设计与实现

计算机科学, 2009, 36(3): 45-47.
[9] .
数据流分析的关键技术研究

计算机科学, 2005, 32(12): 91-93.
[10] 连瑞琦 张兆庆.
低功耗编译的若干相关技术

计算机科学, 2004, 31(8): 164-167.
[11] 舒辉 康绯.
CME分析中的丢番图方程求解

计算机科学, 2002, 29(7): 149-151.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!