基于GCC编译器的流式存储优化方法

doi:10.11896/jsjkx.211200252

计算机科学 ›› 2022, Vol. 49 ›› Issue (11): 76-82.doi: 10.11896/jsjkx.211200252

基于GCC编译器的流式存储优化方法

高秀武, 黄亮明, 姜军

江南计算技术研究所江苏无锡 214083

收稿日期:2021-12-22 修回日期:2022-04-29 出版日期:2022-11-15 发布日期:2022-11-03
通讯作者: 黄亮明(liangming_huang@126.com)
作者简介:(wxgcs@163.com)
基金资助:
国家重点研发计划(2020YFB0204602);综合研究项目(针对申威处理器的编译优化提升技术)

Optimization Method of Streaming Storage Based on GCC Compiler

GAO Xiu-wu, HUANG Liang-ming, JIANG Jun

Jiang Institute of Computing Technology,Wuxi,Jiangsu 214083,China

Received:2021-12-22 Revised:2022-04-29 Online:2022-11-15 Published:2022-11-03
About author:GAO Xiu-wu,born in 1992,postgra-duate.His main research interests include architecture-oriented performance analysis and optimization,compiler optimization,etc.
HUANG Liang-ming,born in 1988,Ph.D,assistant professor,is a member of China Computer Federation.His main research interests include architecture-oriented performance analysis and optimization,compiler optimization,etc.
Supported by:
National Key Research and Development Project(2020YFB0204602) and Comprehensive Research Project(Research on Compilation Optimization Improvement for Sunway Processor).

摘要/Abstract

摘要： 针对流式存储访问引起的缓存污染与强制性缺失问题,部分高性能通用处理器平台提供了不经过缓存而直接访问存储器的专用通路及配套指令支持。在常见的流式存储应用场景中,合理采用直访主存方式可以提高芯片存储器系统的整体性能。然而,判断何时使用直访主存能够获得收益对于程序员来说是一项十分繁琐且容易出错的任务,一种行之有效的方法是通过编译器自动实现。因此,文中在深入分析流式存储访问模式使用不同类型访存操作性能收益的基础上,提出了基于GCC编译器的流式存储优化方法。该方法由编译器自动实现对程序员透明,在GCC编译器SSA-GIMPLE阶段对程序循环中具有流式访问特征的连续写或者跨步写进行识别,并根据收益分析与依赖关系筛选优化对象,最后在编译器后端匹配指令模板生成直访主存指令。使用连续/跨步写用例与STREAM测试集及变体在申威国产处理器平台上进行实验评估,结果表明,文中提出的优化方法能够显著缩短流式存储应用程序的执行时间,优化后STREAM测试集的平均加速比为1.31。另外,文中实现的流式存储优化与循环展开优化一起使用效果更好,STREAM测试集的平均加速比能达到1.45。

关键词: GCC编译器, 直访主存, 编译优化, 代码生成, 国产处理器

Abstract: To solve the problem of cache pollution and mandatory loss caused by streaming memory access,some high-perfor-mance general-purpose processor platforms provide a dedicated path and supporting instructions for accessing memory directly without accessing the cache.The overall performance of chip memory system can be improved by using direct memory access in common application scenarios such as streaming storage.However,it is a tedious and error-prone task for programmers to determine when direct access to main memory is beneficial,and an effective way is to implement it automatically through the compiler.Therefore,based on the in-depth analysis of the benefits of different types of access operations under the streaming storage access mode,this paper proposes a streaming storage optimization method based on GCC compiler.In the SSA-GIMPLE stage of GCC compiler,the continuous write or step write with stream access characteristics in the program loop is recognized,and optimization objects are screened according to the benefit analysis and dependency relationship.Finally,the direct access main memory instructions are generated by matching instruction templates at the back end of compiler.The continuous/step-write case and STREAM test set and their variants are used for experimental evaluation on SW domestic processor platform.The results show that the optimized method can significantly reduce the execution time of STREAM storage applications,and the average acceleration ratio of STREAM test set after optimization is 1.31.Additionally,in conjunction with loop unwinding optimization,the STREAM test set has an average acceleration ratio of 1.45.

Key words: GCC complier, Direct memory access, Compiler optimization, Code generation, Domestic processor

中图分类号:

TP311

高秀武, 黄亮明, 姜军. 基于GCC编译器的流式存储优化方法[J]. 计算机科学, 2022, 49(11): 76-82. https://doi.org/10.11896/jsjkx.211200252

GAO Xiu-wu, HUANG Liang-ming, JIANG Jun. Optimization Method of Streaming Storage Based on GCC Compiler[J]. Computer Science, 2022, 49(11): 76-82. https://doi.org/10.11896/jsjkx.211200252

参考文献

[1]WULF W A,MCKEE S A.Hitting the memory wall:implications of the obvious [J].ACM Sigarch Computer Architecture News,1995,23(1):20-24.
[2]NOWATZYK A,PONG F,SAULSBURY A.Missing the Memory Wall:The Case for Processor/Memory Integration [J].ACM Sigarch Computer Architecture News,1996,24(2):90-101.
[3]DENNING P J.The Locality Principle [J].Communications of the ACM,2005,48(7):19-24.
[4]PING L.Analysis and Development of the Locality Principle[J].Advances in Intelligent and Soft Computing,2012,133(7):211-214.
[5]BRYANT R,O’HALLARON D.Computer systems:a pro-grammer’s perspective [M].Upper Saddle River:Prentice Hall,2003.
[6]VENKATESAN R,KOZHIKKOTTU V J,SHARAD M,et al.Cache Design with Domain Wall Memory[J].IEEE Transactions on Computers,2016,65(4):1010-1024.
[7]BAER J L,CHEN T F.An effective on-chip preloading scheme to reduce data access penalty [C]//IEEE Conference on Supercomputing.ACM,1991.
[8]DONG Y S,LI C J.Mechanism and Capability of Data Prefet-ching in Intel©64 Architecture [J].Computer Science,2016,43(5):34-41.
[9]TIMOTHY S A,JONES M.Software Prefetching for IndirectMemory Accesses[C]//2017 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).ACM,2017.
[10]WANG J H,LI J,LU D D,et al.Hardware prefetching mechanism based on double step data stream[J].Computer Enginee-ring,2019,45(6):115-118,126.
[11]JALEEL A,THEOBALD K B,STEELY S C,et al.High performance cache replacement using re-reference interval prediction(RRIP)[C]//International Symposium on Computer Architecture.ACM,2010.
[12]ZHUANG X T,LEE H.A hardware-based cache pollution filtering mechanism for aggressive prefetches[C]//2003 International Conference on Parallel Processing.IEEE,2003.
[13]PALANCA S,PENTKOVSKI V,TSAI S,et al.Method and apparatus for implementing Nontemporal stores.U.S.Patent 6,205,520[P].2001.
[14]SANDBERG A,EKLOV D,HAGERSTEN E.Reducing cache pollution through detection and elimination of Nontemporal memory accesses[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE Computer Society,2010:1-11.
[15]Intel.Intel©64 and IA-32 Architectures Software Developer’s Manuals,Volume 2B:Instruction Set Reference [Z].September 2016.
[16]ARM.ARM Architecture Reference Manual,ARMv8,for ARM-v8-Aarchitecture profile [Z].September 2016.
[17]KRISHNAIYER R,KULTURSAY E,CHAWLA P,et al.Compiler-Based Data Prefetching and Streaming Nontemporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor[C]//2013 IEEE International Symposium on Parallel & Distributed Processing,Workshops and Phd Forum.IEEE,2013:1576-1586.
[18]Intel© C++ Compiler Classic Developer Guide and Reference [Z].Version 2021.1,December 2020.
[19]Free Software Foundation,Inc.GCC,the GNU compiler collection [EB/OL].(2017-05-02).https://gcc.gnu.org/.
[20]MILLER D W,III D.Performance analysis of disk cache write policies [J].Microprocessors& Micro-systems,1995,19(3):121-130.
[21]SPEC CPU2006 [EB/OL].(2011-10-20).https://www.spec.org/cpu2006/Docs.
[22]SPEC CPU2017 [EB/OL].(2021-04-07).https://www.spec.org/cpu2017/Docs.
[23]MOWRY T C,LAM M S,GUPTA A.Design and Evaluation of a Compiler Algorithm for Prefetching[J/OL].Aplos,1992.https://dl.acm.org/doi/epdf/10.1145/143365.143488.

相关文章 15

[1]	陆浩松, 胡勇华, 王书盈, 周新莲, 李慧祥. 向量DSP的混合资源启发式循环展开因子选择方法研究 Study on Hybrid Resource Heuristic Loop Unrolling Factor Selection Method Based on Vector DSP 计算机科学, 2022, 49(6A): 777-783. https://doi.org/10.11896/jsjkx.210400146
[2]	王博漾, 庞建民, 徐金龙, 赵捷, 陶小涵, 朱雨. 基于多面体模型的矩阵乘法向量代码生成 Matrix Multiplication Vector Code Generation Based on Polyhedron Model 计算机科学, 2022, 49(10): 44-51. https://doi.org/10.11896/jsjkx.210800247
[3]	唐镇, 胡勇华, 陆浩松, 王书盈. 基于弱约束指派的DSP寄存器偶对分配算法研究 Research on DSP Register Pairs Allocation Algorithm with Weak Assigning Constraints 计算机科学, 2021, 48(6A): 587-595. https://doi.org/10.11896/jsjkx.200600061
[4]	胡伟方, 陈云, 李颖颖, 商建东. 基于数据重用分析的多面体循环合并策略 Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation 计算机科学, 2021, 48(12): 49-58. https://doi.org/10.11896/jsjkx.210200071
[5]	杨萍, 王生原. CompCert编译器目标代码生成机制分析 Analysis of Target Code Generation Mechanism of CompCert Compiler 计算机科学, 2020, 47(9): 17-23. https://doi.org/10.11896/jsjkx.200400018
[6]	池昊宇, 陈长波. 基于神经网络的循环分块大小预测 Prediction of Loop Tiling Size Based on Neural Network 计算机科学, 2020, 47(8): 62-70. https://doi.org/10.11896/jsjkx.191200180
[7]	丁嵘, 于千惠. 基于AADL的自主无人系统可成长框架 Growth Framework of Autonomous Unmanned Systems Based on AADL 计算机科学, 2020, 47(12): 87-92. https://doi.org/10.11896/jsjkx.201100173
[8]	李朋远,赵荣彩,高伟,张庆花. 一种支持跨幅访存的向量化代码生成方法 Effective Vectorization Technique for Interleaved Data with Constant Strides 计算机科学, 2015, 42(5): 194-199. https://doi.org/10.11896/j.issn.1002-137X.2015.05.039
[9]	葛红美,徐超,陈念,廖希密. 一种面向嵌入式系统总线的低功耗优化方法 Low Power Optimization Method Oriented to Embedded System’s Bus 计算机科学, 2013, 40(12): 31-36.
[10]	田祖伟,孙光. 基于谓词代码的编译优化技术研究 Research of Compiler Optimization Technology Based on Predicated Code 计算机科学, 2010, 37(5): 130-133.
[11]	邓蓉，陈闳中，李灿，王小明，李捷，张军旗. GridSim仿真代码自动生成器GridsimHelper GridsimHelper:A Code Generator for GridSim Platform 计算机科学, 2010, 37(10): 135-137.
[12]	. 基于汇编代码的指令调度器的设计与实现计算机科学, 2009, 36(3): 45-47.
[13]	张立勇陈平. 基于代码生成的Web信息系统工程化开发方法计算机科学, 2008, 35(5): 284-287.
[14]	. 数据流分析的关键技术研究计算机科学, 2005, 32(12): 91-93.
[15]	戴清涵李宣东赵建华郑国梁. 面向设计流图的代码支撑工具计算机科学, 2005, 32(11): 203-206.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于GCC编译器的流式存储优化方法

Optimization Method of Streaming Storage Based on GCC Compiler

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0