计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 186-194.doi: 10.11896/jsjkx.241100130

• 高性能计算 • 上一篇    下一篇

基于区域划分的跨基本块SLP向量化技术

韩林1,2, 丁永强1, 崔平非1, 刘浩浩2, 李浩然2, 陈梦尧2   

  1. 1 中原工学院网络空间安全学院 郑州 451191
    2 国家超级计算郑州中心 郑州 450001
  • 收稿日期:2024-11-25 修回日期:2025-04-15 出版日期:2025-09-15 发布日期:2025-09-11
  • 通讯作者: 崔平非(cpf1975@126.com)
  • 作者简介:(strollerlin@163.com)
  • 基金资助:
    2024河南省科技攻关项目(242102211094);2022河南省重大科技专项17(221100210600)

SLP Vectorization Across Basic Blocks Based on Region Partitioning

HAN Lin1,2, DING Yongqiang1, CUI Pingfei1, LIU Haohao2, LI Haoran2, CHEN Mengyao2   

  1. 1 College of Cyber Security,Zhongyuan University of Technology,Zhengzhou 451191,China
    2 National Supercomputing Center in Zhengzhou,Zhengzhou 450001,China
  • Received:2024-11-25 Revised:2025-04-15 Online:2025-09-15 Published:2025-09-11
  • About author:HAN Lin,born in 1978,professor,doctoral supervisor,is a member of CCF(No.16416M).His main research interests include high-performance computing,advanced compilation,program optimization and home-grown autonomous control.
    CUI Pingfei,born in 1975,associate professor,master supervisor.His main research interests include domestic so-vereign control,software reverse engineering and code security analysis.
  • Supported by:
    2024 Henan Provincial Science and Technology Tackling Project(242102211094) and 2022 Major Science and Technology Programs in Henan Province 17(221100210600).

摘要: 自动向量化作为发掘数据级并行性、提升程序性能的重要方式,被广泛应用于主流编译器中。超字级向量化(Superword-Level Parallelism,SLP)专注于发掘相邻同构语句级别的数据并行性检测并聚合标量指令生成向量指令。然而,传统的SLP框架在发掘跨基本块的语句向量化时能力不足,特别是当连续的可向量化指令被基本块边界分割时,SLP分析无法有效发掘潜在的向量化语句。针对这一问题,提出了一种基于区域划分的跨基本块SLP向量化方法。该方法通过扩大分析范围至支配关系内的多个基本块,打破了基本块边界的限制,从而能捕捉更多潜在向量化机会,有效提升了SLP向量化效率。所提出的方法基于GCC10.3.0编译器实现,并挑选SPEC CPU2006测试集中包含相关程序段的测试程序进行了实验。实验结果显示,在SPEC CPU2006测试集挑选的测试程序中,与传统SLP方法相比,所提出的方法可使SPEC CPU2006测试程序加速比最高提升12%,相关测试程序的平均加速比提升8%,在polybench测试中获得了平均3%的加速比,其有效性得到验证。该工作可为提升GCC编译中SLP向量化效率提供技术参考。

关键词: 编译优化, 自动向量化, SLP, 跨基本块, 区域划分

Abstract: Automatic vectorization is a key technique in mainstream compilers for uncovering data-level parallelism and enhancing program performance.Traditional SLP vectorization struggles with cross-basic-block statement vectorization,particularly when consecutive vectorizable instructions are split by basic block boundaries,limiting its ability to detect potential vectorization opportunities.To address this,this paper proposes a region-based cross-basic-block SLP vectorization method that extends the analysis scope to multiple basic blocks within dominance relations,effectively breaking basic block boundaries and uncovering more vectori-zation opportunities.Implemented in the GCC 10.3.0 compiler,the proposed method is evaluated using relevant program segments from the SPEC CPU2006 benchmark.Experimental results demonstrate that the proposed method achieves up to a 12% speedup in SPEC CPU2006,an average speedup of 8% for related test programs,and a 3% average speedup in the Polybench benchmark compared to traditional SLP methods,validating its effectiveness.This work provides a technical reference for improving SLP vectorization efficiency in GCC compilers.

Key words: Compilation optimization, Automatic vectorization, SLP, Across basic blocks, Region partitioning

中图分类号: 

  • TP314
[1]GAO W,ZHAO R C,HAN L,et al.Research on SIMD auto-vectorization compiling optimization[J].Ruan Jian Xue Bao/Journal of Software,2015,26(6):1265-1284.
[2]FENG J G,HE Y P,TAO Q M.Auto-vectorization:Recent de-velopment and prospect[J].Journal on Communications,2022,43(3):180-119.
[3]LIU H H,HAN L,CUI P F.Insufficient SLP in GCC[J].Computer Systems & Applications,2022,31(9):265-271.
[4]VENKATESAN A,BANERJEE K,BHATTACHARJEE A,et al.Deep learning inference on ARM:A survey of compute li-braries and quantization techniques[J].ACM Transactions on Embedded Computing Systems,2020,19(1).
[5]HAN S,MAO H,DALLY W J.Neural network accelerationwith efficient floating-point SIMD on FPGAs[C]//2016 IEEE International Solid-State Circuits Conference.IEEE,2016:122-123.
[6]NVIDIA Corporation.Tensor Cores enable high-performanceFP16 inference on NVIDIA Volta GPUs[EB/OL].https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tensor-core-whitepaper.pdf.
[7]AMIRI H,SHAHBAHRAMI A.SIMD programming using Intel vector extensions[J].Journal of Parallel and Distributed Computing,2020,135:83-100.
[8]STOJANOV A,TOSKOV I,ROMPF T,et al.SIMD intrinsics on managed language runtimes[C]//Proceedings of the 2018 International Symposium on Code Generation and Optimization.2018:2-15.
[9]LI J N,HAN L,CHAI G D.Automatic Vectorization Transplant and Optimization of LLVM for Domestic Processors[J].Computer Engineering,2022,48(1):142-148.
[10]NUZMAN D,ZAKS A.Outer-loop vectorization-revisited forshort SIMD architectures[C]//Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques.2008.
[11]HE T.An overview of compilation and optimization of automatic vector quantization based on data leve[J].Intelligent Computer and Application,2016,6(6):68-71.
[12]LARSEN S,AMARASINGHE S.Exploiting superword levelparallelism with multimedia instruction sets[J].Programming Language Design and Implementation,2000,35(5):145-156.
[13]ZHAO J,ZHAO R C.Identifying superword level parallelism with directed graph reachability[J].Scientia Sinica(Informationis),2017,47:310-325.
[14]PORPODAS V,MAGNI A,JONES T M.PSLP:Padded SLP automatic vectorization[C]//Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization.2015:190-201.
[15]FENG J,HE Y,TAO Q,et al.An SLP Vectorization MethodBased on Equivalent Extended Transformation[J].Wireless Communications and Mobile Computing,2022,2022(1):1832522.
[16]FENG J G,HE Y P,TAO Q M,et al.SLP Vectorization MethodBased on Multiple Isomorphic Transformations[J].Journal of Computer Research and Development,2023,60(12):2907-2927.
[17]ZHANG S P,WANG D,DING L L,et al.New framework based on SLP[J].Application Research of Computers,2017,34(1):21-26.
[18]LI Y Y,XI H X,GAO W,et al.SLP vectorization method based on throttling[J].Application Research of Computer,2018,35(9):2578-2582.
[19]XU J L,ZHAO R C,HAN L,et al.SIMD Code Selection Methodfor Inter-Basic-Block[J].Journal of Information Engineering University,2016,17(2):244-249.
[20]CHEN Y S,MENDIS C,AMARASINGHE S.All You Need Is Superword-Level Parallelism:Systematic Control-Flow Vectorization with SLP[C]//Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation(PLDI ’22).New York:ACM,2022:301-315.
[21]YE Z,JIAO J.Loop Unrolling Based on SLP and Register Pressure Awareness[C]//2024 20th International Conference on Natural Computation,Fuzzy Systems and Knowledge Discovery(ICNC-FSKD).2024:1-6.
[22]LI J,GAO W,LI Y,et al.An Improved Method for Control Dependency in LLVM[C]//2024 5th International Conference on Intelligent Computing and Human-Computer Interaction(ICHCI).2024:291-294.
[23]CHEN M Y,NEI K,LI J N,et al.An SLP automatic vectorization method,apparatus and electronic device:CN202311666914.7[P].2024-03-05.
[24]TAYEB H,PAILLAT L,BRAMAS B.Autovesk:AutomaticVectorized Code Generation from Unstructured Static Kernels Using Graph Transformations[J].ACM Transactions on Architecture and Code Optimization,2023,21(1):1-25.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!