并行程序设计语言中局部性机制的研究

doi:10.11896/jsjkx.181202409

计算机科学 ›› 2020, Vol. 47 ›› Issue (1): 7-16.doi: 10.11896/jsjkx.181202409

并行程序设计语言中局部性机制的研究

袁良¹,张云泉¹,白雪瑞²,张广婷¹

(中国科学院计算机体系结构国家重点实验室北京100190)¹;
(中国科学院前沿科学与教育局重点实验室处北京100864)²

收稿日期:2018-12-15 发布日期:2020-01-20
通讯作者: 张云泉(zuq@ict.ac.en)
基金资助:
国家重点研发计划(2017YFB0202001);中国科学院战略性先导科技专项(C 类)(XDC01040100);国家自然科学基金(61432018,61521092,61602443);北京自然科学基金(L182053);中科院高效空间天气预报模式科技创新交叉与合作团队

Research on Locality-aware Design Mechanism of State-of-the-art Parallel Programming Languages

YUAN Liang¹,ZHANG Yun-quan¹,BAI Xue-rui²,ZHANG Guang-ting¹

(State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)¹;
(Bureau of Frontier Sciences and Education,Chinese Academy of Sciences,Beijing 100864,China)²

Received:2018-12-15 Published:2020-01-20
About author:YUAN Liang,born in 1984,Ph.D,associate professor,is member of China Computer Federation (CCF).His main research interests include parallel computational models;ZHANG Yun-quan,born in 1973,Ph.D,professor,Ph.D supervisor,is senior member of China Computer Federation (CCF).His main research interests include high performance computing and parallel processing of big data.
Supported by:
This work was supported by the National Key R&D Program of China (2017YFB0202001),Strategic Priority Research Program of Chinese Academy of Sciences (XDC01040100),National Natural Science Foundation of China (61432018,61521092,61602443),Science Foundation of Beijing (L182053) and CAS Interdisciplinary Innovation Team of Efficient Space Weather Forecast Models.

摘要/Abstract

摘要： 大规模并行应用程序的性能优化和并行化的关键瓶颈之一在于多核CPU中越来越深和越来越复杂的存储层次。文中系统地分析和总结了当前主要多核CPU和并行程序设计语言中的局部性设计方法,提出了两种局部性,即横向局部性和纵向局部性,从这两种局部性的视角深入分析了当前的主要并行程序设计语言的局部性设计机制,进一步总结对比了其优缺点,并指出了新一代并行程序设计语言应具有的特点,重点提出了新语言应同时综合考虑两种局部性支持的设计机制的研究观点。

关键词: 并行程序设计模型, 并行程序设计语言, 并行性, 多核, 局部性

Abstract: The memory access locality of a parallel program becomes a more and more important factor for exploiting more performance from the more and more complex memory hierarchy of current multi-core processors.In this paper,two different kinds of locality concept,horizontal locality and vertical locality,were proposed and defined.The state-of-the-art of parallel programming languages were investigated and analyzed,while the methods and mechanisms on how these parallel programming languages describe and control the memory access locality were analyzed in detail based on these two kinds view of horizontal locality and vertical locality.Finally,some future research directions on parallel programming languages were summarized,especially on the importance of integrating and support both horizontal locality and vertical locality in the future parallel programming language research.

Key words: Locality, Multi-core, Parallel programming language, Parallel programming model, Parallelism

中图分类号:

TP312

袁良,张云泉,白雪瑞,张广婷. 并行程序设计语言中局部性机制的研究[J]. 计算机科学, 2020, 47(1): 7-16. https://doi.org/10.11896/jsjkx.181202409

YUAN Liang,ZHANG Yun-quan,BAI Xue-rui,ZHANG Guang-ting. Research on Locality-aware Design Mechanism of State-of-the-art Parallel Programming Languages[J]. Computer Science, 2020, 47(1): 7-16. https://doi.org/10.11896/jsjkx.181202409

参考文献

[1]HILL M D,MARTY M R.Amdahl’s Law in the Multicore Era[J].IEEE Computer,2008,41(7):33-38.
[2]Message Passing Interface Forum.MPI:A Message-Passing Interface Standard(Version 2.1)[S].High-Performance Computing Center Stuttgart,2008.
[3]OpenMP Standards Board.OpenMP Application Program Interface[OL].http://openmp.org/wp/openmp-specifications/.
[4]ZHANG Y Q,CHEN G L,SUN G Z,et al.Models of parallel computation:a survey and classification[J].Frontiers of Computer Science in China,2007,1(2):156-165.
[5]ZHANG Y Q.DRAM(h):A Parallel Computation Model for High Performance Numerical Computing[J].Chinese Journal of Computers,2003,26(12):1660-1670.
[6]FATAHALIAN K,KNIGHT T J,HOUSTON M,et al.Sequoia: programming the memory hierarchy[C]∥Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC).IEEE,2006:11-17.
[7]KNIGHT T,PARK J,REN M,et al.Compilation for explicitly managed memory hierarchies[C]∥Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP).ACM,2007:14-17.
[8]HOUSTON M,PARK J,REN M,et al.A portable runtime interface for multi-level memory hierarchies[C]∥Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).ACM,2008:20-23.
[9]IEEE POSIX P1003.4a:Threads Extension for Portable Opera- ting Systems[M].Piscataway,NJ:IEEE Press,1994.
[10]High Performance Fortran Forum.High Performance Fortran Language Specification [OL].http://hpff.rice.edu/versions/hpf2.
[11]NUMRICH R,REID J.Co-array Fortran for parallel programming[J].ACM SIGPLAN Fortran Forum,1998,17(2):1-31.
[12]YELICK K,SEMENZATO L,PIKE G,et al.Titanium:A high-performance Java dialect[C]∥ACM 1998 Workshop on Java for High-Performance Network Computing.ACM,1998:1-10.
[13]CARLSON W,DRAPER J M,CULLER D E,et al.Introduction to UPC and language specification[R].University of California-Berkeley,1999.
[14]ALPERN B,CARTER L,FERRANTE J.ZPL A Machine Independent Programming[J].IEEE Transactions on Software Engineering,2000,26(3):197-211.
[15]CALLAHAN D,CHAMBERLAIN B L,ZIMA H P.The Cascade high productivity language[C]∥Ninth International Workshop on High-Level Parallel Programming Models and Suppor-tive Environments.IEEE,2004:52-60.
[16]CHARLES P,GROTHOFF C.X10:an object-oriented approach to non-uniform cluster computing[C]∥Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming,systems,languages,and applications.ACM,2005:16-20.
[17]CUDA 2.2 Programming Guide[OL].http://www.nvidia. com/object/cuda_develop.html.
[18]Stream_Computing_User_Guide[OL].http://developer.amd.com/.
[19]Khronos OpenCL Working Group.The OpenCL Specification Version:1.0[OL].http://www.khronos.org/opencl/.
[20]CHEN T,RAGHAVAN R,DALE J,et al.Cell broadband engine architecture and its first implementation:a performance view[J].IBM Journal of Research and Development,2007,51(5):559-572.
[21]OREN G,GANAN Y,MALAMUD G.Automp:An automatic openmp parallization generator for variable-oriented high-performance scientific codes[J].IJCOPI,2018,9(1):46-53.
[22]BERTOLACCI I,STROUT M,SUPINSKI B,et al.Extending openmp to facilitate loop optimization[C]∥14th International Workshop on OpenMP.IWOMP,2018:53-65.
[23]KANG S,LEE A,LEE K.Performance comparison of openmp,mpi,and mapreduce in practical problems[J].Advances in Multimedia,2015,24(5):1-9.
[24]WU X,TAYLOR V.Performance characteristics of hybrid mpi/openmp scientific applications on a largescale multithreaded blue-gene/q supercomputer[J].IJNDC,2013,1(4):213-225.
[25]BENEDICT S.SCALE-EA:A scalability aware performance tuning framework for openmp applications[J].Scalable Computing:Practice and Experience,2018,19(1):15-30.
[26]LI H,CHEN Z,GUPTA R.Parastack:effecient hang detection for MPI programs at large scale[C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2017:1-12.
[27]MORRIS B,SKJELLUM A.Mpignite:An mpi-like language and prototype implementation for apache spark[J].arXiv:1707.04788.
[28]AUBREY-JONES T,FISCHER B.Synthesizing MPI implementations from functional data-parallel programs[J].International Journal of Parallel Programming.Springer,2016,44(3):552-573.
[29]KOWALEWSKI T,FÜRLINGER K.Nasty-mpi:Debugging synchronization errors in MPI-3 one-sided applications[M]∥European Conference on Parallel Processing.Cham:Springer,2016:51-62.
[30]DENIS A,TRAHAY F.MPI overlap:Benchmark and analysis[C]∥45th International Conference on Parallel Processing.ICPP,2016:258-267.
[31]KONIGES A,COOK B,DESLIPPE J,et al.MPI usage at NERSC:present and future[C]∥Proceedings of the 23rd European MPI Users’ Group Meeting.EuroMPI,2016:217-227.
[32]IMANI M,KIM Y,ROSING T.MPIM:multi-purpose in-memory processing using con gurable resistive memory[C]∥22nd Asia and South Pacific Design Automation Conference.ASP-DAC,2017:757-763.
[33]MÉNDEZ S,REXACHS D,LUQUE E.Analyzing the parallel I/O severity of MPI applications[C]∥Proceedings of the 17th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing.IEEE,2017:953-962.
[34]DAO T,CHIBA S.Semem:Deployment of mpi-based in-memory storage for hadoop on supercomputers[M]∥EuropeanConfe-rence on Parallel Processing.Cham:Springer,2017:442-454.
[35]BAYSER M,CERQUEIRA R.Integrating MPI with docker for HPC[C]∥2017 IEEE International Conference on Cloud Engineering.IC2E,2017:259-265.
[36]AHMED H,SKJELLUM A,BANGALORE P,et al.Transforming blocking MPI collectives to non-blocking and persistent operations[C]∥Proceedings of the 24th European MPI Users’ Group Meeting.EuroMPI,2017:1-11.
[37]MAGOULÈS F,BENISSAN G.JACK2:an mpi-based communication library with non-blocking synchronization for asynchronous iterations[J].Advances in Engineering Software,2018,23(5):116-133.
[38]NIELSEN F.Introduction to HPC with MPI for Data Science
[M]∥Undergraduate Topics in Computer Science.Springer,2016.
[39]BADER D.Evolving MPI+X toward exascale[J].IEEE Computer,2016,49(8):10-18.
[40]MOHAMED H,MARCHAND-MAILLET S.MRO-MPI:ma- preduce overlapping using MPI and an optimized data exchange policy[J].Parallel Computing,2013,39(12):851-866.
[41]YIN J,FORAN A,WANG J.DL-MPI:enabling data locality computation for mpi-based data-intensive applications[C]∥Proceedings of the 2013 IEEE International Conference on Big Data.IEEE,2013:506-511.
[42]SNIR M.Technical perspective:The future of MPI[J].Communications of the ACM,2018,61(10):105-114.
[43]RAMESH S,MAHÉO A,SHENDE S,et al.MPI performance engineering with the MPI tool interface:The integration of MVAPICH and TAU[J].Parallel Computing,2018,77(1):19-37.
[44]PIMENTA A,CÉSAR E,SIKORA A.Methodology for MPI applications autotuning[C]∥20th European MPI Users’s Group Meeting.EuroMPI,2013:145-146.
[45]DEUZEMAN A,REKER S,URBACH C.Lemon:An MPI pa- rallel I/O library for data encapsulation using LIME[J].Compu-ter Physics Communications,2012,183(6):1321-1335.
[46]LI S,ZHANG Y,HOEFER T.Cache-oblivious MPI all-to-all communications based on morton order[J].IEEE Transactions on Parallel and Distributed Systems,2018,29(3):542-555.
[47]ALFATAFTA M,ALSADER Z,AL-KISWANY S.COOL:A cloud-optimized structure for MPI collective operations[C]∥11th IEEE International Conference on Cloud Computing.CLOUD,2018:746-753.
[48]DONATO D.Simple,effcient allocation of modelling runs on heterogeneous clusters with MPI[J].Environmental Modelling and Software,2017,88(3):48-57.
[49]SUBRAMONI H,HAMIDOUCHE K,VENKATESH A,et al.Designing MPI library with dynamic connected transport (DCT) of infiniband:Early experiences[C]∥29th International Confe-rence Supercomputing.ISC,2014:278-295.
[50]ISLAM T,MOHROR K,SCHULZ M.Exploring the MPI tool information interface:features and capabilities[J].IJHPCA,2016,30(2):212-222.
[51]SHARMA A,MOULITSAS I.MPI to coarray fortran:Expe- riences with a CFD solver for unstructured meshes[J].Scientific Programming,2017,55(2):1-12.
[52]SHTERENLIKHT A,MARGETTS L,CEBAMANOS L.Mo- delling fracture in heterogeneous materials on HPC systems using a hybrid mpi/fortran coarray multi-scale CAFE framework[J].Advances in Engineering Software,2018,12(5):155-166.
[53]MATTSON P.A Programming System for the Imagine Media Processor[D].Stanford:University of Stanford,2002.
[54]IBM Corporation.Software development kit for multi-core acce- leration[OL].http://www.ibm.com/developerworks/power/cell.
[55]SCHNEIDER S,YEOM S,ROSE B,et al.A Comparison of Programming Models for Multiprocessors with Explicitly Managed Memory Hierarchies[C]∥Proc.of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).ACM,2009:131-140.

相关文章 15

[1]	杜亮, 任鑫, 张海莹, 周芃. 基于局部回归融合的多核聚类方法 Multiple Kernel Clustering via Local Regression Integration 计算机科学, 2021, 48(8): 47-52. https://doi.org/10.11896/jsjkx.201000106
[2]	瞿伟, 余飞鸿. 基于多核处理器的非对称嵌入式系统研究综述 Survey of Research on Asymmetric Embedded System Based on Multi-core Processor 计算机科学, 2021, 48(6A): 538-542. https://doi.org/10.11896/jsjkx.200900204
[3]	胡伟方, 陈云, 李颖颖, 商建东. 基于数据重用分析的多面体循环合并策略 Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation 计算机科学, 2021, 48(12): 49-58. https://doi.org/10.11896/jsjkx.210200071
[4]	金琪, 王俊昌, 付雄. 基于智能放置策略的Cuckoo哈希表 Cuckoo Hash Table Based on Smart Placement Strategy 计算机科学, 2020, 47(8): 80-86. https://doi.org/10.11896/jsjkx.191200109
[5]	蔡玉鑫, 汤志伟, 赵博, 杨明, 吴禹非. 基于嵌入式多核DSP的加速软件系统 Accelerated Software System Based on Embedded Multicore DSP 计算机科学, 2020, 47(6A): 622-625. https://doi.org/10.11896/JsJkx.190400079
[6]	朱晓玲, 李琨, 张长胜, 杜付鑫. 基于Gabor小波变换和多核支持向量机的电梯导靴故障诊断方法 Elevator Boot Fault Diagnosis Method Based on Gabor Wavelet Transform and Multi-coreSupport Vector Machine 计算机科学, 2020, 47(12): 258-261. https://doi.org/10.11896/jsjkx.200700039
[7]	梁媛,袁景凌,陈旻骋. 利用空间优化的增强学习Sarsa改进预取算法 Prefetching Algorithm of Sarsa Learning Based on Space Optimization 计算机科学, 2019, 46(3): 327-331. https://doi.org/10.11896/j.issn.1002-137X.2019.03.048
[8]	丁贵强, 王雷, 王鹿鸣, 康乔. 多核平台上针对seL4的分区机制研究 Study of Partition Mechanism for seL4 on Multi-core Platform 计算机科学, 2018, 45(9): 70-74. https://doi.org/10.11896／j.issn.1002-137X.2018.09.010
[9]	罗殊彦, 朱怡安, 曾诚. 嵌入式异构多核处理器核间的通信性能评估与优化 Performance Evaluation and Optimization of Inter-cores Communication for Heterogeneous Multi-core Processor Unit 计算机科学, 2018, 45(6A): 262-265.
[10]	冉正, 罗蕾, 晏华, 李允. 基于纳什均衡的AUTOSAR任务到多核ECU的映射方法 Nash Equilibrium Based Method for Mapping AUTOSAR Tasks to Multicore ECU 计算机科学, 2018, 45(6): 166-171. https://doi.org/10.11896/j.issn.1002-137X.2018.06.029
[11]	夏庆勋,庄毅. 一种基于局部性原理的远程验证机制 Remote Attestation Mechanism Based on Locality Principle 计算机科学, 2018, 45(4): 148-151. https://doi.org/10.11896/j.issn.1002-137X.2018.04.024
[12]	游资奇,任怡,刘仁仕,管剑波,刘礼鹏. 基于多核的共生虚拟机通信加速机制XenVMC的优化 Optimization of Co-resident Inter-VM Communication Accelerator XenVMC Based on Multi-core 计算机科学, 2018, 45(3): 102-107. https://doi.org/10.11896/j.issn.1002-137X.2018.03.017
[13]	高放,黄樟钦. 基于异构多核并行加速的嵌入式神经网络人脸识别方法 Embedded Neural Network Face Recognition Method Based on Heterogeneous Multicore Parallel Acceleration 计算机科学, 2018, 45(3): 288-293. https://doi.org/10.11896/j.issn.1002-137X.2018.03.047
[14]	张帅, 徐顺, 刘倩, 金钟. 基于GPU的分子动力学模拟Cell Verlet算法实现及其并行性能分析 Cell Verlet Algorithm of Molecular Dynamics Simulation Based on GPU and Its Parallel Performance Analysis 计算机科学, 2018, 45(10): 291-294. https://doi.org/10.11896／j.issn.1002-137X.2018.10.054
[15]	屈媛媛,洪玫,孙琳. 多核系统动态温度管理TAPE策略的形式化验证 Formal Verification of TAPE Strategy for Dynamic Temperature Management in Multi-core System 计算机科学, 2017, 44(Z11): 542-546. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.115

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

并行程序设计语言中局部性机制的研究

Research on Locality-aware Design Mechanism of State-of-the-art Parallel Programming Languages

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0