计算机科学 ›› 2022, Vol. 49 ›› Issue (11): 76-82.doi: 10.11896/jsjkx.211200252
高秀武, 黄亮明, 姜军
GAO Xiu-wu, HUANG Liang-ming, JIANG Jun
摘要: 针对流式存储访问引起的缓存污染与强制性缺失问题,部分高性能通用处理器平台提供了不经过缓存而直接访问存储器的专用通路及配套指令支持。在常见的流式存储应用场景中,合理采用直访主存方式可以提高芯片存储器系统的整体性能。然而,判断何时使用直访主存能够获得收益对于程序员来说是一项十分繁琐且容易出错的任务,一种行之有效的方法是通过编译器自动实现。因此,文中在深入分析流式存储访问模式使用不同类型访存操作性能收益的基础上,提出了基于GCC编译器的流式存储优化方法。该方法由编译器自动实现对程序员透明,在GCC编译器SSA-GIMPLE阶段对程序循环中具有流式访问特征的连续写或者跨步写进行识别,并根据收益分析与依赖关系筛选优化对象,最后在编译器后端匹配指令模板生成直访主存指令。使用连续/跨步写用例与STREAM测试集及变体在申威国产处理器平台上进行实验评估,结果表明,文中提出的优化方法能够显著缩短流式存储应用程序的执行时间,优化后STREAM测试集的平均加速比为1.31。另外,文中实现的流式存储优化与循环展开优化一起使用效果更好,STREAM测试集的平均加速比能达到1.45。
中图分类号:
[1]WULF W A,MCKEE S A.Hitting the memory wall:implications of the obvious [J].ACM Sigarch Computer Architecture News,1995,23(1):20-24. [2]NOWATZYK A,PONG F,SAULSBURY A.Missing the Memory Wall:The Case for Processor/Memory Integration [J].ACM Sigarch Computer Architecture News,1996,24(2):90-101. [3]DENNING P J.The Locality Principle [J].Communications of the ACM,2005,48(7):19-24. [4]PING L.Analysis and Development of the Locality Principle[J].Advances in Intelligent and Soft Computing,2012,133(7):211-214. [5]BRYANT R,O’HALLARON D.Computer systems:a pro-grammer’s perspective [M].Upper Saddle River:Prentice Hall,2003. [6]VENKATESAN R,KOZHIKKOTTU V J,SHARAD M,et al.Cache Design with Domain Wall Memory[J].IEEE Transactions on Computers,2016,65(4):1010-1024. [7]BAER J L,CHEN T F.An effective on-chip preloading scheme to reduce data access penalty [C]//IEEE Conference on Supercomputing.ACM,1991. [8]DONG Y S,LI C J.Mechanism and Capability of Data Prefet-ching in Intel©64 Architecture [J].Computer Science,2016,43(5):34-41. [9]TIMOTHY S A,JONES M.Software Prefetching for IndirectMemory Accesses[C]//2017 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).ACM,2017. [10]WANG J H,LI J,LU D D,et al.Hardware prefetching mechanism based on double step data stream[J].Computer Enginee-ring,2019,45(6):115-118,126. [11]JALEEL A,THEOBALD K B,STEELY S C,et al.High performance cache replacement using re-reference interval prediction(RRIP)[C]//International Symposium on Computer Architecture.ACM,2010. [12]ZHUANG X T,LEE H.A hardware-based cache pollution filtering mechanism for aggressive prefetches[C]//2003 International Conference on Parallel Processing.IEEE,2003. [13]PALANCA S,PENTKOVSKI V,TSAI S,et al.Method and apparatus for implementing Nontemporal stores.U.S.Patent 6,205,520[P].2001. [14]SANDBERG A,EKLOV D,HAGERSTEN E.Reducing cache pollution through detection and elimination of Nontemporal memory accesses[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE Computer Society,2010:1-11. [15]Intel.Intel©64 and IA-32 Architectures Software Developer’s Manuals,Volume 2B:Instruction Set Reference [Z].September 2016. [16]ARM.ARM Architecture Reference Manual,ARMv8,for ARM-v8-Aarchitecture profile [Z].September 2016. [17]KRISHNAIYER R,KULTURSAY E,CHAWLA P,et al.Compiler-Based Data Prefetching and Streaming Nontemporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor[C]//2013 IEEE International Symposium on Parallel & Distributed Processing,Workshops and Phd Forum.IEEE,2013:1576-1586. [18]Intel© C++ Compiler Classic Developer Guide and Reference [Z].Version 2021.1,December 2020. [19]Free Software Foundation,Inc.GCC,the GNU compiler collection [EB/OL].(2017-05-02).https://gcc.gnu.org/. [20]MILLER D W,III D.Performance analysis of disk cache write policies [J].Microprocessors& Micro-systems,1995,19(3):121-130. [21]SPEC CPU2006 [EB/OL].(2011-10-20).https://www.spec.org/cpu2006/Docs. [22]SPEC CPU2017 [EB/OL].(2021-04-07).https://www.spec.org/cpu2017/Docs. [23]MOWRY T C,LAM M S,GUPTA A.Design and Evaluation of a Compiler Algorithm for Prefetching[J/OL].Aplos,1992.https://dl.acm.org/doi/epdf/10.1145/143365.143488. |
[1] | 陆浩松, 胡勇华, 王书盈, 周新莲, 李慧祥. 向量DSP的混合资源启发式循环展开因子选择方法研究 Study on Hybrid Resource Heuristic Loop Unrolling Factor Selection Method Based on Vector DSP 计算机科学, 2022, 49(6A): 777-783. https://doi.org/10.11896/jsjkx.210400146 |
[2] | 王博漾, 庞建民, 徐金龙, 赵捷, 陶小涵, 朱雨. 基于多面体模型的矩阵乘法向量代码生成 Matrix Multiplication Vector Code Generation Based on Polyhedron Model 计算机科学, 2022, 49(10): 44-51. https://doi.org/10.11896/jsjkx.210800247 |
[3] | 唐镇, 胡勇华, 陆浩松, 王书盈. 基于弱约束指派的DSP寄存器偶对分配算法研究 Research on DSP Register Pairs Allocation Algorithm with Weak Assigning Constraints 计算机科学, 2021, 48(6A): 587-595. https://doi.org/10.11896/jsjkx.200600061 |
[4] | 胡伟方, 陈云, 李颖颖, 商建东. 基于数据重用分析的多面体循环合并策略 Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation 计算机科学, 2021, 48(12): 49-58. https://doi.org/10.11896/jsjkx.210200071 |
[5] | 杨萍, 王生原. CompCert编译器目标代码生成机制分析 Analysis of Target Code Generation Mechanism of CompCert Compiler 计算机科学, 2020, 47(9): 17-23. https://doi.org/10.11896/jsjkx.200400018 |
[6] | 池昊宇, 陈长波. 基于神经网络的循环分块大小预测 Prediction of Loop Tiling Size Based on Neural Network 计算机科学, 2020, 47(8): 62-70. https://doi.org/10.11896/jsjkx.191200180 |
[7] | 丁嵘, 于千惠. 基于AADL的自主无人系统可成长框架 Growth Framework of Autonomous Unmanned Systems Based on AADL 计算机科学, 2020, 47(12): 87-92. https://doi.org/10.11896/jsjkx.201100173 |
[8] | 李朋远,赵荣彩,高 伟,张庆花. 一种支持跨幅访存的向量化代码生成方法 Effective Vectorization Technique for Interleaved Data with Constant Strides 计算机科学, 2015, 42(5): 194-199. https://doi.org/10.11896/j.issn.1002-137X.2015.05.039 |
[9] | 葛红美,徐超,陈念,廖希密. 一种面向嵌入式系统总线的低功耗优化方法 Low Power Optimization Method Oriented to Embedded System’s Bus 计算机科学, 2013, 40(12): 31-36. |
[10] | 田祖伟,孙光. 基于谓词代码的编译优化技术研究 Research of Compiler Optimization Technology Based on Predicated Code 计算机科学, 2010, 37(5): 130-133. |
[11] | 邓蓉,陈闳中,李灿,王小明,李捷,张军旗. GridSim仿真代码自动生成器GridsimHelper GridsimHelper:A Code Generator for GridSim Platform 计算机科学, 2010, 37(10): 135-137. |
[12] | . 基于汇编代码的指令调度器的设计与实现 计算机科学, 2009, 36(3): 45-47. |
[13] | 张立勇 陈平. 基于代码生成的Web信息系统工程化开发方法 计算机科学, 2008, 35(5): 284-287. |
[14] | . 数据流分析的关键技术研究 计算机科学, 2005, 32(12): 91-93. |
[15] | 戴清涵 李宣东 赵建华 郑国梁. 面向设计流图的代码支撑工具 计算机科学, 2005, 32(11): 203-206. |
|