基于Intel MIC架构的3D有限差分算法优化

doi:10.11896/j.issn.1002-137X.2017.05.005

摘要/Abstract

摘要： 有限差分算法是一种基于偏微分方程的数值离散方法,被广泛应用于弹性波传播问题的数值模拟中。该算法访存跨度大、计算密度高、CPU利用率低,这在实际应用中成为了性能瓶颈。针对上述问题,在详析3D有限差分算法(3DFD)的基础上,基于Intel MIC架构,采用三步递进法对其进行优化:首先,通过分支消除、循环展开、不变量外提等基本优化法削减计算强度并为向量化扫除障碍；然后,通过分析数据依赖及循环分块,使用向量指令集改写核心算法等并行优化法,充分利用MIC协处理器多线程、长向量的机制；最后,在异构众核平台(CPU+MIC:Many Integra-ted Cores)下通过数据传输最小化、负载均衡等异构协同优化法实现CPU和MIC的并行计算。实验验证,与原有算法相比,优化后的算法在异构平台上获得了50～120倍的加速。

关键词: 有限差分算法,MIC架构,向量化,异构协同,并行计算

Abstract: Finite difference algorithm is a numerical discrete method based on the partial differential equation which is widely applied in elastic wave propagation simulation.Because of the high computation density,long distance memory access pattern and low CPU utilization,it becomes the performance bottleneck in practical applications.Aiming at solving above problems,this paper deliberated the key points of 3D finite difference(3DFD) algorithm and then proposed the three-step progressive method to optimize 3DFD algorithm based on Intel MIC.Firstly,the basic optimization methods,such as branch elimination,loop unroll,and invariant extraction,were proposed to reduce calculation strength and remove the obstacle of SIMD(Single Instruction Multiple Data).Secondly,by leveraging the parallel optimization methods such as data dependence analysis,loop tiling,and intrinsic SIMD instructions,it took full advantage of the mechanism of MIC coprocessor with multithreads and long vector.At last,the heterogeneous cooperative optimization methods,such as data transformation minimization and load balancing,were applied to the platform of CPU+MIC(Many Integrated Cores) which parallelizes the algorithm execution in both CPU and MIC.Experimental results show that the optimized 3DFD algorithm gains 50~120 speedup compared with original algorithm.

Key words: Finite difference algorithm,MIC architecture,SIMD,Heterogeneous cooperation,Parallel computation

郝鑫,郭绍忠. 基于Intel MIC架构的3D有限差分算法优化[J]. 计算机科学, 2017, 44(5): 26-32. https://doi.org/10.11896/j.issn.1002-137X.2017.05.005

HAO Xin and GUO Shao-zhong. Optimization of 3D Finite Difference Algorithm on Intel MIC[J]. Computer Science, 2017, 44(5): 26-32. https://doi.org/10.11896/j.issn.1002-137X.2017.05.005

参考文献

[1] Top500.org.Top 500 supercomputer site [EB/OL].[2015-12-09].http://www.top500.org.
[2] 王恩东,张清,沈铂,等.MIC高性能计算编程指南[M].北京:中国水利水电出版社,2012:15-21.
[3] ZHUKOV V T,KRANSON M M,NOVIKOVA ND,et al.Multigrid effectiveness on modern computing architectures[J].Programming and Computer Software,2015,2(1):14-22.
[4] REINDERS J,JEFFERS J.High Performance Parallelism Pearls,Multicore and Many-core Programming[M].Waltham,MA,USA:Morgan Kaufmann,2014:377-396.
[5] ITU L M,SUCIU C,MOLDOVEANU F,et al.GPU Optimized Computation of Stencil Based Algorithms[C]∥2011 RoEduNet International Conference 10th Edition:Networking in Education and Reasearch.2011:1-6.
[6] WANG G B,YANG X J,ZHANG Y,et al.Program Optimization of Stencil Based Application on the GPU-accelerated System[C]∥2009 IEEE International Symposium on Parallel and Distributed Processing with Applications.2009:219-225.
[7] HERNANDEZ M,CEBRIAN J M,CECILIA J M,et al.Evaluating 3-D Stencil codes on Intel Xeon Phi:Limitations and Tradeoffs[C]∥XXVI Jornadas De Paralelismo.2015:441-444.
[8] KOMATITSCH D,ERLEBACHER G,GODDEKE C,et al.High-order finite element seismic wave propagation modeling with MPI on a large GPU cluster[J].Journal of Computational physics,2010,9(20):7692-7714.
[9] FANG J B,SIPS H,ZHANG L L,et al.Test- driving Intel Xeon Phi[C]∥5th ACM/SPEC International Conference on Perfor-mance Engineering.2014:137-148.
[10] LIANG X,WANG Z Y,LIU G H,et al.Numerical simulation of elastic wave based on the staggered grid finite difference method[C]∥2011 International Conference on Consumer Electronics,Communications and Networks.2011:3283-3286.
[11] WANG T,LI L G,ZHANGY,et al.Pseudo-spectral method for modeling elastic wave propagation in isotropic medium[C]∥12th International Conference on Signal Processing.2014:58-62.
[12] WOSKOBOYNIKOVA G M.Seismic wave propagation in fractured media[C]∥Third International Forum on Strategic Technologies.2008:307-309.
[13] LIU H W,LI B,LIU H,et al.The Algorithm of high order finitedifference pre-stack reverse time migration and GPU implementation [J].Chinese Journal of Geophysics,2010,3(7):1725-1733.(in Chinese) 刘红伟,李博,刘洪,等.地震叠前逆时偏移高阶有限差分算法及GPU实现[J].地球物理学报,2010,53(7):1725-1733.
[14] PERAZA J,TIWARI A,LAURENZANO M,et al.Understan-ding the performance of stencil computations on Intel's Xeon Phi[C]∥IEEE International Conference on Cluster Computing.2013:1-5.
[15] SHAN Y,WU J P,WANG Z H.Hierarchial parallel programming model and parallelization and optimization techniques based on SMP cluster[J].Appolication Research of Computers,2006,3(10):254-256.(in Chinese) 单莹,吴建平,王正华.基于SMP集群的多层次并行编程模型与并行优化技术[J].计算机应用研究,2006,3(10):254-256.
[16] SATO M.OpenMP:Parallel Programming API for shared memo-ry multiprocessors and on-chip multiprocessors[C]∥15th International Symposium on System Synthesis.2002:109-111.
[17] JEFFERS J,REINDERS J.Intel Xeon Phi coprocessor high-performance programming [M].陈建,周珊,李慧等,译.北京:人民邮电出版社,2014:92-96.
[18] STROUT M M,LUPORINI F,KRIEGER C D,et al.Generalizing Run-time Tiling with the Loop Chain Abstraction[C]∥IEEE 28th International Parallel and Distributed Processing Symposium.2014:1136-1145.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed