计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 5-16.doi: 10.11896/jsjkx.200600045
阳王东, 王昊天, 张宇峰, 林圣乐, 蔡沁耘
YANG Wang-dong, WANG Hao-tian, ZHANG Yu-feng, LIN Sheng-le , CAI Qin-yun
摘要: 随着人工智能和大数据等计算机应用对算力需求的迅猛增长以及应用场景的多样化, 异构混合并行计算成为了研究的重点。文中介绍了当前主要的异构计算机体系结构, 包括CPU/协处理器、CPU/众核处理器、CPU/ASCI和CPU/FPGA等;简述了异构混合并行编程模型随着各类异构混合结构的发展而做出的改变, 异构混合并行编程模型可以是对现有的一种语言进行改造和重新实现, 或者是现有异构编程语言的扩展, 或者是使用指导性语句异构编程, 或者是容器模式协同编程。分析表明, 异构混合并行计算架构会进一步加强对AI的支持, 同时也会增强软件的通用性。文中还回顾了异构混合并行计算中的关键技术, 包括异构处理器之间的并行任务划分、任务映射、数据通信、数据访问, 以及异构协同的并行同步和异构资源的流水线并行等。根据这些关键技术, 文中指出了异构混合并行计算面临的挑战, 如编程困难、移植困难、数据通信开销大、数据访问复杂、并行控制复杂以及资源负载不均衡等。最后分析了异构混合并行计算面临的挑战, 指出目前关键的核心技术需要从通用与AI专用异构计算的融合、异构架构的无缝移植、统一编程模型、存算一体化、智能化任务划分和分配等方面进行突破。
中图分类号:
[1] | GELADO I, KELM J H, RYOO S, et al.CUBA:an architecture for efficient CPU/coprocessor data communication∥Proceedings of the 22nd Annual International Conference on Supercomputing.2008:299-308. |
[2] | ROWEN C, JOHNSON M, RIES P.The MIPS R3010 floating-point coprocessor.IEEE Micro, 1988, 8(3):53-62. |
[3] | BREY B B.The Intel microprocessors:8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions:architecture, programming, and interfacing.Pearson Education India, 2009. HINDS C N.An enhanced floating point coprocessor for embedded signal processing and graphics applications∥Conference Record of the Thirty-Third Asilomar Conference onSignals, Systems, and Computers (Cat.No.CH37020).IEEE, 1999, 1:147-151. |
[5] | SOHN J H, WOO J H, YOO J, et al.Design and test of fixed-point multimedia co-processor for mobile applications∥Proceedings of the Design Automation & Test in Europe Confe-rence.IEEE, 2006, 2:1-5. Outline of the Development of the Post-K computer[EB/OL].https://www.r-ccs.riken.jp/en/postk/project/outline |
[7] | BARBALACE A, RAVINDRAN B, KATZ D.Popcorn:areplicated-kernel OS based on Linux∥Proceedings of the Linux Symposium.Ottawa, Canada, 2014. |
[8] | MLLER M, SPINCZYK O.MxKernel:Rethinking OperatingSystem Architecture for Many-core Hardware∥9th Workshop on Systemsfor Multi-core and Heterogenous Architectures.2019. |
[9] | AGGARWAL K, BONDHUGULA U.Optimizing the linear fascicle evaluation algorithm for many-core systems∥Procee-dings of the ACM International Conference on Supercomputing.2019:425-437.HUHN W P, LANGE B, YU V W, et al.GPGPU acceleration of all-electron electronic structure theory using localized numeric atom-centered basis functions.arXiv:1912.06636. |
[11] | GUBNER T, TOM D, LANG H, et al.Fluid Co-processing:GPU Bloom-filters for CPU Joins∥Proceedings of the 15th International Workshop on Data Management on New Hardware.2019:1-10. |
[12] | NIE J, ZHANG C, ZOU D, et al.Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture∥Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference.2019:6-10. |
[13] | KHAIRY M, WASSAL A G, ZAHRAN M.A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity.Journal of Parallel and Distributed Computing, 2019, 127:65-88. |
[14] | BARAJAS C, GOBBERT M K, KROIZ G C, et al.Challenges and opportunities for the simulation of calcium waves onmodern-multi-core and many-core parallel computing platforms.International Journal for Numerical Methods in Biomedical Engineering.https://doi.org/10.1002/cnm.3244. |
[15] | SODANI A, GRAMUNT R, CORBAL J, et al.Knights landing:Second-generation intel xeon phi product.Ieee micro, 2016, 36(2):34-46. |
[16] | MAGAKI I, KHAZRAEE M, GUTIERREZ L V, et al.Asicclouds:Specializing the datacenter∥2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).IEEE, 2016:178-190. |
[17] | PENG Y, ZHU W, ZHAO Y.Cross-media analysis and reasoning:advances and directions[J].Frontiers of Information Technology & Electronic Engineering, 2017, 18(1):44-57. |
[18] | LI B, GU J, JIANG W.Artificial Intelligence (AI)Chip Technology Review∥2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI).IEEE, 2019:114-117. |
[19] | CHEN T, DU Z, SUN N, et al.Diannao:A small-footprint high-throughput accelerator for ubiquitous machine-learning.ACM SIGARCH Computer Architecture News, 2014, 42(1):269-284. |
[20] | MU R, ZENG X.A Review of Deep Learning Research.TIIS, 2019, 13(4):1738-1764. |
[21] | OVTCHAROV K, RUWASE O, KIM J Y, et al.Toward accelerating deep learning at scale using specialized hardware in the datacenter∥2015 IEEE Hot Chips 27 Symposium (HCS).IEEE Computer Society, 2015:1-38. |
[22] | HU L J, CHEN N G, LI J, et al.FPGA Heterogeneous Computing Platform and Its Application.Electric Power Information and Communication Technology, 2016, 14(7):6-11. |
[23] | STROMME A, CARLSON R, NEWHALL T.Chestnut:A Gpu programming language for non-experts∥Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores.2012:156-167. |
[24] | AUERBACH J, BACON D F, CHENG P, et al.Lime:a Java-compatible and synthesizable language for heterogeneous architectures∥Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications.2010:89-108. |
[25] | LINDERMAN M D, COLLINS J D, WANG H, et al.Merge:a programming model for heterogeneous multi-core systems.ACM SIGOPS Operating Systems Review, 2008, 42(2):287-296. |
[26] | CUDA.https://developer.nvidia.com/cuda-zone. |
[27] | HAN T D, ABDELRAHMAN T S.hiCUDA:a high-level directive-based language for GPU programming∥Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units.2009:52-61. |
[28] | BAKHTIN V A, KRYUKOV V A, CHETVERUSHKIN B N, et al.Extension of the DVM parallel programming model for clusters with heterogeneous nodes.Doklady Mathematics, 2011, 84(3):879-881. |
[29] | LEE S, VETTER J S.Moving Heterogeneous GPU Computing into the Mainstream with Directive-Based, High-Level Programming Models (Position Paper)∥DOE Exascale Research Conference.2012. |
[30] | The OpenCL standard[OL].https://www.khron os.org/opencl/. |
[31] | RASCH A, BIGGE J, WRODARCZYK M, et al.dOCAL:high-level distributed programming with OpenCL and CUDA.The Journal of Supercomputing, 2020, 76:5117-5138. |
[32] | WU S, DONG X, ZHANG X, et al.NoT:a high-level no-threading parallel programming method for heterogeneous systems.The Journal of Supercomputing, 2019, 75(7):3810-3841. |
[33] | PANDIT P, GOVINDARAJAN R.Fluidic kernels:Cooperativeexecution of opencl programs on multiple heterogeneous devices∥Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization.2014:273-283. |
[34] | C++ Accelerated Massive Parallelism[OL].https://docs.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2012/hh265137(v=vs.110)?redirectedfrom=MSDN. |
[35] | VIAS M, BOZKUS Z, FRAGUELA B B.Exploiting heterogeneous parallelism with the Heterogeneous Programming Library.Journal of Parallel and Distributed Computing, 2013, 73(12):1627-1638.DE SUPINSKI B R, SCOGLAND T R W, DURAN A, et al.The ongoing evolution of openmp.Proceedings of the IEEE, 2018, 106(11):2004-2019. |
[37] | WANG X, LEIDEL J D, CHEN Y.OpenMP Memkind:An Extension for Heterogeneous Physical Memories∥2017 46th International Conference on Parallel Processing Workshops (ICPPW).IEEE, 2017:220-227. |
[38] | FUMERO J J, DE SANDE F.accull:An user-directed approach to heterogeneous programming∥2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.IEEE, 2012:654-661. |
[39] | LEE S, VETTER J S.OpenARC:open accelerator research compiler for directive-based, efficient heterogeneous computing∥Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing.2014:115-120. |
[40] | LEE S, VETTER J S.OpenARC:extensible OpenACC compiler framework for directive-based accelerator programming study∥2014 First Workshop on Accelerator Programming Using Directives.IEEE, 2014:1-11. |
[41] | ZHANG J, LU X, CHU C H, et al.C-GDR:High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks∥2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).IEEE, 2019:242-251. |
[42] | CHEN Y W, HUNG S H, TU C H, et al.Virtual hadoop:Mapreduce over docker containers with an auto-scaling mechanism for heterogeneous environments∥Proceedings of the International Conference on Research in Adaptive and Convergent Systems.2016:201-206. |
[43] | MAO Y, OAK J, POMPILI A, et al.Draps:Dynamic and re-source-aware placement scheme for docker containers in a hetero-geneous cluster∥2017 IEEE 36th InternationalPerfor-mance Computing and Communications Conference (IPCCC).IEEE, 2017:1-8. |
[44] | YANG W, LI K, LI K.A hybrid computing method of SpMV on CPU-GPU heterogeneous computing systems.Journal of Parallel and Distributed Computing, 2017, 104:49-60. |
[45] | HOSSEINABADY M, NUNEZ-YANEZ J.Sparse Matrix-Dense Matrix Multiplication on Heterogeneous CPU+ FPGA Embedded System∥Proceedings of the 11th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures/9th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms.2020:1-6. |
[46] | KOBAYASHI R, FUJITA N, YAMAGUCHI Y, et al.GPU-FPGA Heterogeneous Computing with OpenCL-Enabled Direct Memory Access∥2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).IEEE, 2019:489-498. |
[47] | QUAN Z, WANG Z J, YE T, et al.Task Scheduling for Energy Consumption Constrained Parallel Applications on Heterogeneous Computing Systems.IEEE Transactions on Parallel and Distributed Systems, 2019, 31(5):1165-1182. |
[48] | PECCERILLO B, BARTOLINI S.Task-DAG Support in Single-Source PHAST Library:Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures∥Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores.2019:91-100. |
[49] | ALEBRAHIM S, AHMAD I.Task scheduling for heterogeneous computing systems.The Journal of Supercomputing, 2017, 73(6):2313-2338. |
[50] | KELEFOURAS V, DJEMAME K.Workflow Simulation Aware and Multi-Threading Effective Task Scheduling for Heterogeneous Computing∥2018 IEEE 25th International Conference on High Performance Computing (HiPC).IEEE, 2018:215-224. |
[51] | KUMAR N, MAYANK J, MONDAL A.Reliability aware Energy Optimized Scheduling of Non-preemptive Periodic Real-Time Tasks on Heterogeneous Multiprocessor System.IEEE Transactions on Parallel and Distributed Systems, 2019, 31(4):871-885. |
[52] | CRUZ E H M, DIENER M, PILLA L L, et al.EagerMap:a task mapping algorithm to improve communication and load balancing in clusters of multicore systems.ACM Transactions on Parallel Computing (TOPC), 2019, 5(4):1-24. |
[53] | CRUZ E H M, DIENER M, PILLA L L, et al.An efficient algorithm for communication-based task mapping∥2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.IEEE, 2015:207-214. |
[54] | BOSCH J, VIDAL M, FILGUERAS A, et al.Breaking master-slave model between host and FPGAs∥Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.2020:419-420. |
[55] | LI A, SONG S L, CHEN J, et al.Evaluating Modern GPU Interconnect:PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect.IEEE Transactions on Parallel and Distributed Systems, 2019, 31(1):94-110. |
[56] | SHUI C, YU X, YAN Y, et al.Revisiting linpack algorithm on large-scale CPU-GPU heterogeneous systems∥Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.2020:411-412. |
[57] | LIANG L, ZHANG Q, SONG P, et al.Overlapping communication and computation of GPU/CPU heterogeneous parallel spatial domain decomposition MOC method.Annals of Nuclear Energy, 2020, 135:106988. |
[58] | ZHANG J, JUNG M.An in-depth performance analysis of ma-ny-integrated core for communication efficient heterogeneous computing∥IFIP International Conference on Network and Parallel Computing.Cham:Springer, 2017:155-159. |
[59] | HU Y, YANG H, LUAN Z, et al.Massively scaling seismic processing on sunway taihulight supercomputer.IEEE Transactions on Parallel and Distributed Systems, 2019, 31(5):1194-1208. |
[60] | ZHENG T, NELLANS D, ZULFIQAR A, et al.Towards high performance paged memory for GPUs∥2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).IEEE, 2016:345-357. |
[61] | DAI H, LIN Z, LI C, et al.Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls∥2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).IEEE, 2018:208-220. |
[62] | GANGULY D, ZHANG Z, YANG J, et al.Interplay betweenhardware prefetcher and page eviction policy in CPU-GPU unified virtual memory∥Proceedings of the 46th International Symposium on Computer Architecture.2019:224-235. |
[63] | YU L, CHEN T, WU M, et al.Last level cache layout remapping for heterogeneous systems.Journal of Systems Architecture, 2018, 87:49-63. |
[64] | RAWAT P S, RASTELLO F, SUKUMARAN-RAJAM A, et al.Register optimizations for stencils on GPUs∥Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.2018:168-182. |
[65] | NELSON J, PALMIERI R.Don’t Forget About Synchronization! A Case Study of K-Means on GPU∥Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores.2019:11-20. |
[66] | NIDAW B Y, OH M H, KIM Y W.Appropriate Synchronization Time Allocation for Distributed Heterogeneous Parallel Computing Systems.KSII Transactions on Internet & Information Systems, 2019, 13(11). |
[67] | OH C, ZHENG Z, SHEN X, et al.GOPipe:a granularity-oblivious programming framework for pipelined stencil executions on GPU∥Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.2019:431-432. |
[68] | ZHANG P, FANG J, YANG C, et al.Optimizing Streaming Pa-rallelism on Heterogeneous Many-Core Architectures.IEEE Transactions on Parallel and Distributed Systems, 2020, 31(8):1878-1896. |
[69] | ZHENG Z, OH C, ZHAI J, et al.HiWayLib:A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations∥Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.2019:153-166. |
[70] | FANG X D.Research on CPU GPU heterogeneous parallel technology for large-scale scientific computing .Changsha:National University of Defense Technology, 2009. |
[71] | MICHALAKES J, VACHHARAJANI M.GPU Acceleration of NWP:Benchmark Kernels.http://www.inmm.ucar.edu/wrf/WG2/GPU.2009-02-25. |
[72] | SARKAR S, ALAVANI G.How Easy it is to Write Software for Heterogeneous Systems?.ACM SIGSOFT Software Engineering Notes, 2018, 42(4):1-7. |
[73] | AGULLO M, DEMMEL J, DONGARRA J, et al.Numericallinear algebra on emerging architectures:the PLASMA and MAGMA projects .Journal of Physics:Conference Series, 2009, 180(1):012037. |
[74] | LTAIEF H, TOMOV S, NATH R, et al.A Sealable High Performant Cholesky Factorization for Multicore with GPU Acce-lerators ∥International Conference on High Performance Computing for Computational Science.Berlin:Springer, 2010:93-101. |
[75] | LU F, SONG J, YIN F, et al.Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters.Computer Physics Communications, 2012, 183(6):1172-1181. |
[76] | STONE J E, GOHARA D, SHI G.OpenCL:A Parallel Pro-gramming Standard for Heterogeneous Computing Systems.Computing in Science & Engineering, 2010, 12(3):66-73. |
[77] | HAN T D, ABDELRAHMAN T S.hiCUDA:High-Level GPGPU Programming.IEEE Transactions on Parallel & Distri-buted Systems, 2011, 22(1):78-90. |
[78] | LIU X Y, ZHAO Q, NIE W.Research on Computer Image Video Processing from the Perspective of C++AMP.China Computer & Communication, 2018(21):29. |
[79] | XIAO S.Generalizing the Utility of Graphics Processing Units in Large-Scale Heterogeneous Computing Systems.Blacksburg:Virginia Tech, 2013. |
[80] | LIUY, LU F, WANG L, et al.Research on Heterogeneous Parallel Programming Model.Journal of Software, 2014, 25(7):1459-1475. |
[81] | GODDEKE D, WOBKER H, STRZODKA R, et a1.Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU.International Journal of Computational Science and Engineering, 2009, 4(4):254-269. |
[82] | KALIDAS R, DAGA M, KROMMYDAS K, et al.On the Performance, Energy, and Power of Data-Access Methods in Heterogeneous Computing Systems∥ IEEE International Parallel &Distributed Processing Symposium Workshop.IEEE, 2015. |
[83] | YUN K Y.Synthesis of asynchronous controllers for heterogeneous systems.Standford:Stanford University, 1994. |
[84] | NVIDIA Corporation.CUDA C programming guide(Version 5)[Z].2013. |
[85] | ANDRONIKOS T, CIORBA F M, RIAKIOTAKIS I, et al.Studying the impact of synchronization frequency on scheduling tasks with dependencies in heterogeneous systems.Perfor-mance Evaluation, 2010, 67(12):1324-1339. |
[86] | ZHONG Z, RYCHKOV V, LASTOVETSKY A.Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications∥2012 IEEE International Conference on Cluster Computing.IEEE, 2012:191-199. |
[87] | YANG W, LI K, LI K.A hybrid computing methodof SpMV on CPU-GPU heterogeneous computing systems.Journal of Parallel and Distributed Computing, 2017, 104(JUN.):49-60. |
[88] | ZHONG Z, RYCHKOV V, LASTOVETSKY A.Data Partitioning on Multicore and Multi-GPU Platforms Using Functional Performance Models.IEEE Transactions on Computers, 2015, 64(9):2506-2518. |
[89] | NEETESH K, PRAKASH V D.A Hybrid Heuristic for Load-Balanced Scheduling of Heterogeneous Workload on Heterogeneous Systems.The Computer Journal, 2019, 62(2):276-291. |
[90] | BARAGLIA R, FERRINI R, RITROVATO P.A static mapping heuristics to map parallel applications to heterogeneous computing systems.Concurrency & Computation Practice & Experience, 2005, 17(13):1579-1605. |
[91] | ITURRIAGA S, NESMACHNOW S, LUNA F, et al.A parallel local search in CPU/GPU for scheduling independent tasks on large heterogeneous computing systems.Journal of Supercomputing, 2015, 71(2):648-672.YANG Wang-dong, doctor, professor.His main research interests include high performance computing and parallel computing. |
[1] | 马梦宇, 吴烨, 陈荦, 伍江江, 李军, 景宁. 显示导向型的大规模地理矢量实时可视化技术[J]. 计算机科学, 2020, 47(9): 117-122. |
[2] | 陈国良, 张玉杰. 并行计算学科发展历程[J]. 计算机科学, 2020, 47(8): 1-4. |
[3] | 冯凯, 李婧. k元n方体的子网络可靠性研究[J]. 计算机科学, 2020, 47(7): 31-36. |
[4] | 杨宗霖, 李天瑞, 刘胜久, 殷成凤, 贾真, 珠杰. 基于Spark Streaming的流式并行文本校对[J]. 计算机科学, 2020, 47(4): 36-41. |
[5] | 邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用[J]. 计算机科学, 2020, 47(11A): 425-429. |
[6] | 徐传福,王曦,刘舒,陈世钊,林玉. 基于Python的大规模高性能LBM多相流模拟[J]. 计算机科学, 2020, 47(1): 17-23. |
[7] | 徐磊, 陈荣亮, 蔡小川. 基于非结构化网格的高可扩展并行有限体积格子[J]. 计算机科学, 2019, 46(8): 84-88. |
[8] | 舒娜,刘波,林伟伟,李鹏飞. 分布式机器学习平台与算法综述[J]. 计算机科学, 2019, 46(3): 9-18. |
[9] | 李炎, 马俊明, 安博, 曹东刚. 一个基于Web的轻量级大数据处理与可视化工具[J]. 计算机科学, 2018, 45(9): 60-64. |
[10] | 张滨, 乐嘉锦. 基于列存储的MapReduce分布式Hash连接算法[J]. 计算机科学, 2018, 45(6A): 471-475. |
[11] | 廖星,袁景凌,陈旻骋. 一种自适应权重的并行PSO快速装箱算法[J]. 计算机科学, 2018, 45(3): 231-234. |
[12] | 张小川, 李琴, 南海, 彭丽蓉. 改进UCT算法在爱恩斯坦棋中的应用[J]. 计算机科学, 2018, 45(12): 196-200. |
[13] | 李俊, 童钊, 王政. 一种并行ACS-2-opt算法处理TSP问题的方法[J]. 计算机科学, 2018, 45(11A): 138-142. |
[14] | 姚庆, 郑凯, 刘垚, 王肃, 孙军, 徐梦轩. SOM算法在申威众核上的实现和优化[J]. 计算机科学, 2018, 45(11A): 591-596. |
[15] | 刘端阳, 郑江帆, 沈国江, 刘志. 基于CUDA的k-means算法并行化研究[J]. 计算机科学, 2018, 45(11): 292-297. |
|