计算机科学 ›› 2025, Vol. 52 ›› Issue (6): 66-73.doi: 10.11896/jsjkx.240700009
姜军, 顾晓阳, 徐坤坤, 吕勇帅, 黄亮明
JIANG Jun, GU Xiaoyang, XU Kunkun, LYU Yongshuai, HUANG Liangming
摘要: 在国产申威处理器中,申威GCC编译器在对程序进行向量化时,使用自动向量化和内嵌汇编的方式很难对某些复杂的程序进行向量化,阻碍了国产申威处理器的性能发挥。针对部分程序不能向量化的问题,在申威GCC编译器中进行SIMD编程接口的设计与研究。在申威向量指令的基础上,通过在申威GCC编译器中添加向量机器模式和向量数据类型,编译器可以对向量参数类型进行识别。根据向量指令的类型和复杂度,分别使用内建函数扩展、操作符扩展和高级语言扩展3种方式实现SIMD编程接口函数。在后端添加不同的指令模板,使接口函数可以匹配相应的指令模板,生成对应向量指令的汇编代码。通过对FFTW库和Hyperscan库进行测试和分析,相比优化前的程序,使用SIMD编程接口进行向量化后,FFTW中Double类和Float类型程序的平均加速比分别为1.97和2.13,Hyperscan的平均加速比为2.94。
中图分类号:
[1]ARIKPOI I,OGBAN F U,ETENG I E.Von neumann architecure and modern computers[J].Global Journal of Mathematical Sciences,2007,6(2):97-103. [2]RUDSINSKI L,PIEPER G W.Evaluating computer programperformance on the CRAY-1:ANL-79-9; TRN:79-008828[R]. Argonne,IL:Argonne National Lab.,1979. [3]DONGARRA J.Report on the Sunway TaihuLight System:UT-EECS-16-742 [R].University of Tennessee,2016. [4]ASANOVICK,BODIK R,DEMMEL J,et al.A view of the parallel computing landscape[J].Communications of the ACM,2009,52(10):56-67. [5]REDDY V,SUDHAKAR A,SIVAKUMAR P.Computing Performance Enhancement of VLIW Architecture Using Instruction Level Parallelism[J].International Journal of Innovative Science and Research Technology,2020,5(9):431-435. [6]YIAPANIS P,BROWN G,LUJAN M.Compiler-Driven Soft-ware Speculation for Thread-Level Parallelism[J].ACM Transactions on Programming Languages and Systems,2015,38(2):1-45. [7]LIMOUSINC,SEBOT J,VARTANIAN A,et al.Architectureoptimization for multimedia application exploiting data and thread-level parallelism[J].Journal of Systems Architecture,2005,51(1):15-27. [8]RAMAN S K,PENTKOVSKI V,KESHAVA J.Implementing streaming SIMD extensions on the Pentium III processor[J].IEEE Micro,2000,20(4):47-57. [9]CEBRIANJ M,NATVIG L,JAHRE M.Scalability analysis of AVX-512 extensions[J].The Journal of Supercomputing,2020,76(3):2082-2097. [10]ODAJIMA T,KODAMA Y,SATO M.Power performance analysis of ARM scalable vector extension[C]//IEEE Symposium in Low-Power and High-Speed Chips(COOL CHIPS).IEEE,2018:1-3. [11]GAO W,ZHAO R C,HAN L,et al.Research on SIMD Auto-Vectorization Compiling Optimization[J].Journal of Software,2015,26(6):1265-1284. [12]FENG J G,HE Y P,TAO Q M.Evaluation of compilers' capability of automatic vectorization based on source code analysis[J].Scientific Programming,2021,2021:1-15. [13]KONG M,VERAS R,SADAYAPPAN P.When polyhedraltransformations meet SIMD code generation[C]//Proc.of the 34th ACM SIGPLAN Conf.on Programming Language Design and Implementation.ACM,2013:127-138. [14]AMIRI H,SHAHBAHRAMI A.SIMD programming using Intelvector extensions[J].Journal of Parallel and Distributed Computing,2020,135:83-100. [15]BRAMASB.A fast vectorized sorting implementation based on the ARM scalable vector extension(SVE)[J].PeerJ Computer Science,2021,7:e769. [16]RACORDON D.From ASTs to Machine Code with LLVM[C]//Companion Proceedings of the 5th International Conference on the Art,Science,and Engineering of Programming.New York:ACM,2021:68-76. [17]WANG X W,WANGK X,YANG Q S.Research and Development of Computer Based on GCC[M]// Recent Advances in Computer Science and Information Engineering.Berlin:Springer,2012:809-814. [18]NOVILLO D.GCC an architectural overview,current status,and future directions[C]//Proceedings of the Linux Symposium.Ottawa:Linux Symposium,2006:185. [19]FRIGO M,JOHNSON S G.FFTW an adaptive software architecture for the FFT[C]//Proceedings of the 1998 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,1998:1381-1384. [20]WANGX,HONG Y,CHANG H,et al.Hyperscan:A fast multi-pattern regex matcher for modern CPUs[C]//16th USENIX Symposium on Networked Systems Design and Implementation.USENIX Association,2019:631-648. |
|