Computer Science ›› 2017, Vol. 44 ›› Issue (12): 1-10.doi: 10.11896/j.issn.1002-137X.2017.12.001

    Next Articles

Performance Analysis of GPU Programs Towards Better Memory Hierarchy Design

TANG Tao, PENG Lin, HUANG Chun and YANG Can-qun   

  • Online:2018-12-01 Published:2018-12-01

Abstract: With higher peak performance and energy efficiency than CPUs,as well as increasingly mature software environment,GPUs have become one of the most popular accelerators to build heterogeneous parallel computing systems.Generally,GPU hides memory access latency through flexible and light-weight thread switch mechanism,but its memory system faces severe pressure because of the massive parallelism and its actual performance is enormously impacted by the efficiency of memory access operations.Therefore,the analysis and optimization of GPU program’s memory access behavior have always been hot research topics in GPU-related studies.However,few existing works have analyzed the impact of memory hierarchy design on performance from the view of architecture.In order to better guide the design of GPU’s memory hierarchy and program optimizations,we analyzed the influence of GPU’s each memory hierarchy on the program performance in detail from the view of experiment in this paper,and summarized several strategies for both the memory hierarchy design of future GPU-like architectures and program optimizations.

Key words: Heterogeneous system,GPU,Memory hierarchy,Performance analysis,Optimization

[1] Top500[EB/OL].http://www.top500.org/lists/2016/06,2016.
[2] FATAHALIAN K,SUGERMAN J,HANRAHAN P.Under-standing the efficiency of GPU algorithms for matrix-matrix multiplication[C]∥Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware.New York,NY,USA,2004:133-137.
[3] LINDHOLM E,NICKOLLS J,OBERMAN S,et al.NVIDIATesla:A Unified Graphics and Computing Architecture[J].IEEE Micro.,2008,8(2):39-55.
[4] BARRIO V M D,GONZALEZ C,ROCA J,et al.ATTILA:a cycle-level execution-driven simulator for modern GPU architectures[C]∥IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS 2006).Austin,Te-xas,USA,2006:231-241.
[5] SHEAFFER J W,LUEBKE D,SKADRON K.A flexible simulation framework for graphics architectures[C]∥ACM Siggraph/Eurographics Symposium on Graphics Hardware 2004.Grenoble,France,2004:85-94.
[6] COLLANGE S,DAUMAS M,DEFOUR D,et al.Barra:A Pa-rallel Functional Simulator for GPGPU[C]∥2012 IEEE 20th International Symposium on Modeling,Analysis and Simulation of Computer and Telecommunication Systems.IEEE,2010:351-360.
[7] BAKHODA A,YUAN G L,FUNG W W L,et al.Analyzing CUDA workloads using a detailed GPU simulator[C]∥IEEE International Symposium on Performance Analysis of Systems and Software.IEEE,2009:163-174.
[8] UBAL R,JANG B,MISTRY P,et al.Multi2Sim:a simulation framework for CPU-GPU computing[C]∥International Confe-rence on Parallel Architectures & Compilation Techniques.2012:335-344.
[9] NVIDIA Corp.NVIDIA CUDA C Programming Guide[J].Nvidia Corporation,2011,0(18):8.
[10] MUNSHI A.The opencl specification[C]∥2009 IEEE Hot Chips Symposium (HCS).IEEE,2009:1-314.
[11] AAMODT T M,UNG W W L,HETHERINGTON T H.GPGPU-Sim 3.x Manual,Revision 1.2.[EB/OL].http://gpgpu-sim.org/manual/index.php/Main_Page.
[12] SAAVEDRA-BARRERA R H.CPU performance evaluationand execution time prediction using Narrow spectrum benchmarking[D].University of California,Berkeley,1992.
[13] SMITH A J,SAAVEDRA R H.Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes[J].IEEE Transactions on Computers,1995,4(10):1223-1235.
[14] MEI X,ZHAO K,LIU C,et al.Benchmarking the Memory Hie-rarchy of Modern GPUs[M]∥Network and Parallel Computing.Springer Berlin Heidelberg,2014:144-156.
[15] MEI X,CHU X.Dissecting GPU Memory Hierarchy through Microbenchmarking[EB/OL].https://arxiv.org/abs/1509.02308.
[16] CANDEL F,PETIT S,SAHUQUILLO J,et al.Accuratelymodeling the GPU memory subsystem[C]∥International Conference on High Performance Computing & Simulation.IEEE,2015:179-186.
[17] BAGHSORKHI S S,GELADO I,DELAHAYE M,et al.Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors[J].Acm Sigplan Notices,2012,7(8):23-34.
[18] VOLKOV V,DEMMEL J W.Benchmarking GPUs to tunedense linear algebra[C]∥International Conference for High Performance Computing,Networking,Storage and Analysis.SC,2008:1-11.
[19] PAPADOPOULOU M M,SADOOGHI-ALVANDI M,WONG H.Micro-benchmarking the GT200 GPU[R].Computer Group,ECE,University of Toronto,2009.
[20] WONG H,PAPADOPOULOU M M,SADOOGHI-ALVANDIM,et al.Demystifying GPU microarchitecture through microbenchmarking[C]∥IEEE International Symposium on Performance Analysis of Systems & Software.IEEE.2010:235-246.
[21] ZHANG Y,OWENS J D.A quantitative performance analysismodel for GPU architectures[C]∥Proc.of IEEE 17th International Symposium on High Performance Computer Architecture (HPCA).IEEE,2011:382-393.
[22] MELTZER R,ZENG C,CECKA C.Micro-benchmarking theC2070[C]∥Poster of GPU Technology Conference.San Jose,California,2013.
[23] BASKARAN M M,BONDHUGULA U,KRISHNAMOOR-THY S,et al.Automatic data movement and computation mapping for multi-level parallel architectures with explicitly mana-ged memories[C]∥Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.New York,NY,USA,2008:1-10.
[24] MOAZENI M,BUI A,SARRAFZADEH M.A memory optimization technique for software-managed scratchpad memory in GPUs[C]∥IEEE Symposium on Application Specific Processors(Sasp 2009).San Francisco,CA,USA,2009:43-49.
[25] YANG X,WANG L,XUE J,et al.Comparability graph coloring for optimizing utilization of stream register files in stream processors[C]∥Proceedings of the 14th ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming.New York,NY,USA,2009:111-120.
[26] GAO S.Improving GPU Shared Memory Access Efficiency[D].University of Tennessee,2014.
[27] GOU C,GAYDADJIEV G N.Addressing GPU on-chip shared memory bank conflicts using elastic pipeline[J].International Journal of Parallel Programming,2013,1(3):400-429.
[28] SILBERSTEIN M,SCHUSTER A,GEIGER D,et al.Efficient computation of sum-products on GPUs through software-ma-naged cache[C]∥Proceedings of the 22nd Annual International Conference on Supercomputing.New York,NY,USA,2008:309-318.
[29] CHEN L,AGRAWAL G.Optimizing MapReduce for GPUswith effective shared memory usage[C]∥Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing.ACM,2012:199-210.
[30] LEVERICH J,ARAKIDA H,SOLOMATNIKOV A,et al.Com-paring memory systems for chip multiprocessors [C]∥Procee-dings of the 34th Annual International Symposium on Computer Architecture.New York,NY,USA,2007:358-368.
[31] GOVINDARAJU N K,LARSEN S,GRAY J,et al.A memory model for scientific algorithms on graphics processors[C]∥Proceedings of the 2006 ACM/IEEE Conference on Supercompu-ting.New York,NY,USA,2006.
[32] XIE X,LIANG Y,SUN G,et al.An Efficient Compiler Framework for Cache Bypassing on GPUs[C]∥IEEE/ACM International Conference on Computer-Aided Design,Digest of Technical Papers.2013:516-523.
[33] BASKARAN M M,BONDHUGULA U,KRISHNAMOOR-THY S,et al.A compiler framework for optimization of affine loop nests for gpgpus[C]∥Proceedings of the 22nd Annual International Conference on Supercomputing.New York,NY,USA,2008:225-234.
[34] LI C,SONG S L,DAI H,et al.Locality-driven dynamic GPU cache bypassing[C]∥Proceedings of the 29th ACM on International Conference on Supercomputing.ACM,2015:67-77.
[35] JIA W,SHAW K A,MARTONOSI M.Characterizing and improving the use of demand-fetched caches in GPUs[C]∥Proc.of the 26th ACM International Conference on Supercomputing.ACM,2012:15-24.
[36] HESTNESS J,KECKLER S W,WOOD D A.A comparativeanalysis of microarchitecture effects on CPU and GPU memory system behavior[C]∥2014 IEEE International Symposium on Workload Characterization (IISWC).Raleigh,NC,2014:150-160.
[37] WU B,ZHAO Z,ZHANG E Z,et al.Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU[J].ACM Sigplan Notices,2013,48(8):57-68.
[38] JOG A,KAYIRAN O,KESTEN T,et al.Anatomy of GPUMemory System for Multi-Application Execution[C]∥International Symposium on Memory Systems (MEMSYS).2015:223-234.
[39] CHE S,SHEAFFER J W,SKADRON K.DYMAXION:Optimizing memory access patterns for heterogeneous systems[C]∥Proc.of 2011 International Conference for High Performance Computing.Networking,Storage and Analysis,2011:1-11.
[40] SUNG I,ANSSARI N,STRATTON J A,et al.Data LayoutTransformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications[J].International Journal of Parallel Programming,2012,0(1):4-24.
[41] YANG Y,XIANG P,KONG J,et al.A GPGPU compiler for memory optimization and parallelism management[C]∥Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation.New York,NY,USA,2010:86-97.
[42] JANG B,SCHAA D,MISTRY P,et al.Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures[J].IEEE Transactions on Parallel & Distributed Systems,2011,2(1):105-118.
[43] JANG B,CHOI M,KIM K K.Algorithmic GPGPU memory optimization[J].Journal of Semiconductor Technology and Science,2014,14(4):391-406.
[44] TANG M,ZHAO J Y,TONG R F,et al.GPU accelerated convex hull computation[J].Computers & Graphics,2012,6(5):498-506.
[45] MUYANOZCELIK P,OWENS J D,XIA J,et al.Fast Deforma-ble Registration on the GPU:A CUDA Implementation of Demons[C]∥International Conference on Computational Science and ITS Applications.2008:223-233.
[46] ERRA U,FROLA B,SCARANO V,et al.An Efficient GPU Implementation for Large Scale Individual-Based Simulation of Collective Behavior[C]∥International Workshop on High PERFORMANCE Computational Systems Biology.IEEE,2009:51-58.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . [J]. Computer Science, 2018, 1(1): 1 .
[2] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[3] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[4] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[5] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[6] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[7] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[8] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[9] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[10] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .