Computer Science ›› 2017, Vol. 44 ›› Issue (12): 1-10.doi: 10.11896/j.issn.1002-137X.2017.12.001
TANG Tao, PENG Lin, HUANG Chun and YANG Can-qun
[1] Top500[EB/OL].http://www.top500.org/lists/2016/06,2016. [2] FATAHALIAN K,SUGERMAN J,HANRAHAN P.Under-standing the efficiency of GPU algorithms for matrix-matrix multiplication[C]∥Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware.New York,NY,USA,2004:133-137. [3] LINDHOLM E,NICKOLLS J,OBERMAN S,et al.NVIDIATesla:A Unified Graphics and Computing Architecture[J].IEEE Micro.,2008,8(2):39-55. [4] BARRIO V M D,GONZALEZ C,ROCA J,et al.ATTILA:a cycle-level execution-driven simulator for modern GPU architectures[C]∥IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS 2006).Austin,Te-xas,USA,2006:231-241. [5] SHEAFFER J W,LUEBKE D,SKADRON K.A flexible simulation framework for graphics architectures[C]∥ACM Siggraph/Eurographics Symposium on Graphics Hardware 2004.Grenoble,France,2004:85-94. [6] COLLANGE S,DAUMAS M,DEFOUR D,et al.Barra:A Pa-rallel Functional Simulator for GPGPU[C]∥2012 IEEE 20th International Symposium on Modeling,Analysis and Simulation of Computer and Telecommunication Systems.IEEE,2010:351-360. [7] BAKHODA A,YUAN G L,FUNG W W L,et al.Analyzing CUDA workloads using a detailed GPU simulator[C]∥IEEE International Symposium on Performance Analysis of Systems and Software.IEEE,2009:163-174. [8] UBAL R,JANG B,MISTRY P,et al.Multi2Sim:a simulation framework for CPU-GPU computing[C]∥International Confe-rence on Parallel Architectures & Compilation Techniques.2012:335-344. [9] NVIDIA Corp.NVIDIA CUDA C Programming Guide[J].Nvidia Corporation,2011,0(18):8. [10] MUNSHI A.The opencl specification[C]∥2009 IEEE Hot Chips Symposium (HCS).IEEE,2009:1-314. [11] AAMODT T M,UNG W W L,HETHERINGTON T H.GPGPU-Sim 3.x Manual,Revision 1.2.[EB/OL].http://gpgpu-sim.org/manual/index.php/Main_Page. [12] SAAVEDRA-BARRERA R H.CPU performance evaluationand execution time prediction using Narrow spectrum benchmarking[D].University of California,Berkeley,1992. [13] SMITH A J,SAAVEDRA R H.Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes[J].IEEE Transactions on Computers,1995,4(10):1223-1235. [14] MEI X,ZHAO K,LIU C,et al.Benchmarking the Memory Hie-rarchy of Modern GPUs[M]∥Network and Parallel Computing.Springer Berlin Heidelberg,2014:144-156. [15] MEI X,CHU X.Dissecting GPU Memory Hierarchy through Microbenchmarking[EB/OL].https://arxiv.org/abs/1509.02308. [16] CANDEL F,PETIT S,SAHUQUILLO J,et al.Accuratelymodeling the GPU memory subsystem[C]∥International Conference on High Performance Computing & Simulation.IEEE,2015:179-186. [17] BAGHSORKHI S S,GELADO I,DELAHAYE M,et al.Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors[J].Acm Sigplan Notices,2012,7(8):23-34. [18] VOLKOV V,DEMMEL J W.Benchmarking GPUs to tunedense linear algebra[C]∥International Conference for High Performance Computing,Networking,Storage and Analysis.SC,2008:1-11. [19] PAPADOPOULOU M M,SADOOGHI-ALVANDI M,WONG H.Micro-benchmarking the GT200 GPU[R].Computer Group,ECE,University of Toronto,2009. [20] WONG H,PAPADOPOULOU M M,SADOOGHI-ALVANDIM,et al.Demystifying GPU microarchitecture through microbenchmarking[C]∥IEEE International Symposium on Performance Analysis of Systems & Software.IEEE.2010:235-246. [21] ZHANG Y,OWENS J D.A quantitative performance analysismodel for GPU architectures[C]∥Proc.of IEEE 17th International Symposium on High Performance Computer Architecture (HPCA).IEEE,2011:382-393. [22] MELTZER R,ZENG C,CECKA C.Micro-benchmarking theC2070[C]∥Poster of GPU Technology Conference.San Jose,California,2013. [23] BASKARAN M M,BONDHUGULA U,KRISHNAMOOR-THY S,et al.Automatic data movement and computation mapping for multi-level parallel architectures with explicitly mana-ged memories[C]∥Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.New York,NY,USA,2008:1-10. [24] MOAZENI M,BUI A,SARRAFZADEH M.A memory optimization technique for software-managed scratchpad memory in GPUs[C]∥IEEE Symposium on Application Specific Processors(Sasp 2009).San Francisco,CA,USA,2009:43-49. [25] YANG X,WANG L,XUE J,et al.Comparability graph coloring for optimizing utilization of stream register files in stream processors[C]∥Proceedings of the 14th ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming.New York,NY,USA,2009:111-120. [26] GAO S.Improving GPU Shared Memory Access Efficiency[D].University of Tennessee,2014. [27] GOU C,GAYDADJIEV G N.Addressing GPU on-chip shared memory bank conflicts using elastic pipeline[J].International Journal of Parallel Programming,2013,1(3):400-429. [28] SILBERSTEIN M,SCHUSTER A,GEIGER D,et al.Efficient computation of sum-products on GPUs through software-ma-naged cache[C]∥Proceedings of the 22nd Annual International Conference on Supercomputing.New York,NY,USA,2008:309-318. [29] CHEN L,AGRAWAL G.Optimizing MapReduce for GPUswith effective shared memory usage[C]∥Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing.ACM,2012:199-210. [30] LEVERICH J,ARAKIDA H,SOLOMATNIKOV A,et al.Com-paring memory systems for chip multiprocessors [C]∥Procee-dings of the 34th Annual International Symposium on Computer Architecture.New York,NY,USA,2007:358-368. [31] GOVINDARAJU N K,LARSEN S,GRAY J,et al.A memory model for scientific algorithms on graphics processors[C]∥Proceedings of the 2006 ACM/IEEE Conference on Supercompu-ting.New York,NY,USA,2006. [32] XIE X,LIANG Y,SUN G,et al.An Efficient Compiler Framework for Cache Bypassing on GPUs[C]∥IEEE/ACM International Conference on Computer-Aided Design,Digest of Technical Papers.2013:516-523. [33] BASKARAN M M,BONDHUGULA U,KRISHNAMOOR-THY S,et al.A compiler framework for optimization of affine loop nests for gpgpus[C]∥Proceedings of the 22nd Annual International Conference on Supercomputing.New York,NY,USA,2008:225-234. [34] LI C,SONG S L,DAI H,et al.Locality-driven dynamic GPU cache bypassing[C]∥Proceedings of the 29th ACM on International Conference on Supercomputing.ACM,2015:67-77. [35] JIA W,SHAW K A,MARTONOSI M.Characterizing and improving the use of demand-fetched caches in GPUs[C]∥Proc.of the 26th ACM International Conference on Supercomputing.ACM,2012:15-24. [36] HESTNESS J,KECKLER S W,WOOD D A.A comparativeanalysis of microarchitecture effects on CPU and GPU memory system behavior[C]∥2014 IEEE International Symposium on Workload Characterization (IISWC).Raleigh,NC,2014:150-160. [37] WU B,ZHAO Z,ZHANG E Z,et al.Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU[J].ACM Sigplan Notices,2013,48(8):57-68. [38] JOG A,KAYIRAN O,KESTEN T,et al.Anatomy of GPUMemory System for Multi-Application Execution[C]∥International Symposium on Memory Systems (MEMSYS).2015:223-234. [39] CHE S,SHEAFFER J W,SKADRON K.DYMAXION:Optimizing memory access patterns for heterogeneous systems[C]∥Proc.of 2011 International Conference for High Performance Computing.Networking,Storage and Analysis,2011:1-11. [40] SUNG I,ANSSARI N,STRATTON J A,et al.Data LayoutTransformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications[J].International Journal of Parallel Programming,2012,0(1):4-24. [41] YANG Y,XIANG P,KONG J,et al.A GPGPU compiler for memory optimization and parallelism management[C]∥Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation.New York,NY,USA,2010:86-97. [42] JANG B,SCHAA D,MISTRY P,et al.Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures[J].IEEE Transactions on Parallel & Distributed Systems,2011,2(1):105-118. [43] JANG B,CHOI M,KIM K K.Algorithmic GPGPU memory optimization[J].Journal of Semiconductor Technology and Science,2014,14(4):391-406. [44] TANG M,ZHAO J Y,TONG R F,et al.GPU accelerated convex hull computation[J].Computers & Graphics,2012,6(5):498-506. [45] MUYANOZCELIK P,OWENS J D,XIA J,et al.Fast Deforma-ble Registration on the GPU:A CUDA Implementation of Demons[C]∥International Conference on Computational Science and ITS Applications.2008:223-233. [46] ERRA U,FROLA B,SCARANO V,et al.An Efficient GPU Implementation for Large Scale Individual-Based Simulation of Collective Behavior[C]∥International Workshop on High PERFORMANCE Computational Systems Biology.IEEE,2009:51-58. |
No related articles found! |
|