面向DSWP并行的OpenMP任务调度机制的扩展与实现

摘要/Abstract

摘要： 多核处理器能够提升多线程程序的性能,但早已存在的诸多单线程程序无法从中获益,程序员也习惯于编写单线程程序。自动并行化技术是将单线程程序移植到多核上的重要手段,但是当循环中存在无法确定的数据依赖或复杂的控制流时,传统的自动并行化技术无法取得良好效果。Ottoni等人针对传统自动并行失败的循环提出了Decoupled Software Pipelining(DSWP)算法用以实现指令级的细粒度并行,但其需要对处理器体系结构的深入了解以及对核间通信队列和专用指令的硬件支持,并行性能和应用广泛性受到限制。基于OpenMP应用编程接口实现的DSWP并行不依赖于硬件上对核间通信队列和专用指令的支持,且不受平台的限制,但现有的OpenMP任务调度机制无法满足DSWP并行中对任务调度的需求。对现有的OpenMP任务调度机制进行扩展,增加了任务与线程绑定的属性,保证了基于OpenMP的DSWP并行程序的正确执行。在GCC的OpenMP运行库libgomp中扩展了任务绑定属性子句的功能,扩展后的GCC作为OpenMP DSWP程序的基础编译器,为自动并行提供支持。通过对基准测试集NPB3.3.1的测试表明,传统自动并行失败的循环,经OpenMP DSWP自动并行后在双核处理器上平均加速比达到1.23以上；使用添加了OpenMP DSWP算法的Open64编译器生成的并行程序,与仅使用传统自动并行方法的Intel编译器和Open64编译器所得程序相比,平均加速比分别高出22%和26%。

关键词: 自动并行化,OpenMP,DSWP,任务调度机制,GCC 中图法分类号TP314文献标识码A

Abstract: While multicore processors increase throughput for multi-programmed and multithreaded codes,many important applications are single threaded and thus are not benefited．Automatic parallelization techniques play an important role in migrating singe threaded applications to multicore platform．Unfortunately,the prevalence of control flow,recursive data structures,and general pointer accesses in ordinary programs renders the existing techniques unsuitable．Ottoni et al．proposed an automatic parallelization algorithm called Decoupled Software Pipelining(DSWP)to exploit fine-grained pipeline parallelism at the instruction level．But it requires knowledge of micro-architectural properties and hardware support of a communication channel and two special instructions．The improved DSWP algorithm based on OpenMP increases the parallel granularity and does not rely on hardware support any more,but the existing OpenMP task scheduling mechanism cannot satisfy the need of DSWP．A new binding clause for the task construct in OpenMP was proposed to extend the task scheduling mechanism．It guarantees the correctness of the OpenMP DSWP parallelization．The new clause is implemented in the GCC runtime library libgomp,which provides support for the compilation of OpenMP DSWP programs．The experimental results show that loops failed to be parallelized by existing techniques can be parallelized by the improved automatic parallelization algorithm and gain significant performance improvement on dual-core CPU．The average performance speedup is up to 1.23．Compared with Intel and Open64compilers,the compiler with the improved algorithm can increase execution efficiency evidently and the average speedup of the OpenMP DSWP programs generated by it increases more than 22% and 26%.

Key words: Automatic parallelization,OpenMP,Decoupled software pipelining,Task scheduling mechanism,GCC

刘晓娴,赵荣彩,丁锐. 面向DSWP并行的OpenMP任务调度机制的扩展与实现[J]. 计算机科学, 2013, 40(9): 38-43. https://doi.org/

LIU Xiao-xian,ZHAO Rong-cai and DING Rui. Extension to OpenMP Task Scheduling Mechanism for DSWP Parallelization and its Implementation[J]. Computer Science, 2013, 40(9): 38-43. https://doi.org/

参考文献

[1] Benoit A,Melhem R,Renaud-Goud P,et al．Power-aware Manhattan routing on chip multiprocessors[C]∥Proceedings of 26th International Parallel and Distributed Processing Symposium．Shanghai,2012:189-200
[2] Jin Hao-qiang,Jespersen D,Mehrotra P,et al.High performance computing using MPI and OpenMP on multi-core parallel systems[J]．Parallel Computing,2011,37(9):562-575
[3] 丁锐,赵荣彩,韩林．基于主导值的计算和数据自动划分算法[J].计算机科学,2012,39(3):290-294
[4] Allen R,Kennedy K．Optimizing compilers for modern architectures:a dependence-based approach[M]．California:Morgan Kaufmann Publisher,2001:63-68
[5] Lin Yu-te,Wang Shao-chung,Shih Wen-li,et al.Enable OpenCL compiler with Open64infrastructures[C]∥Proceedings of 13th IEEE International Conference on High Performance Computing and Communications．Alberta,2011:863-868
[6] Gerber R,Smith K B,Bik A J C,et al.The sofware optimization cookbook:high-performance recipes for IA-32platforms(2st ed)[M]．Hillsboro:Intel Press,2006:13-27
[7] Ottoni G,Rangan R,Stoler A,et al．Automatic thread extraction with decoupled software pipelining[C]∥Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture．Washington,DC,2005:105-118
[8] August D I,Connors D A,Mahlke S A,et al．Integrated predication and speculative execution in the IMPACT EPIC architecture[C]∥Proceedings of the 25th International Symposium on Computer Architecture．Barcelona,1998:227-237
[9] 富弘毅,丁滟,宋伟,等．一种利用并行复算实现的OpenMP容错机制[J].软件学报,2012,23(2):411-427
[10] Thoman P,Jordan H,Pellegrini S,et al．Automatic OpenMPloop scheduling:a combined compiler and runtime approach[C]∥Proceedings of 8th International Workshop on OpenMP．Rome,2012:88-101
[11] Ramshankar R．Open64 Compiler Developer Guide．ht-tp://developer.amd.com/tools/cpu/ open64/Documents/open64_compiler_developer_guide.html,2009-12
[12] Hurson A R,Lim J T,Kavi K M,et al.Parallelization of DOALL and DOACROSS loops——a survey[J]．Advances in Computers,1997,45:53-103

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed