计算机科学 ›› 2014, Vol. 41 ›› Issue (6): 12-17.doi: 10.11896/j.issn.1002-137X.2014.06.003
许瑾晨,郭绍忠,黄永忠,王磊
XU Jin-chen,GUO Shao-zhong,HUANG Yong-zhong and WANG Lei
摘要: 数学库函数算法的特性致使函数存在大量的访存,而当前异构众核的从核结构采用共享主存的方式实现数据访问,从而严重影响了从核的访存速度,因此异构众核结构中数学库函数的性能无法满足高性能计算的要求。为了有效解决此问题,提出了一种基于访存指令的调度策略,亦即将访存延迟有效地隐藏于计算延迟中,以提高基于汇编实现的数学函数库的函数性能;结合动态调用方式,利用从核本地局部数据存储空间LDM(local data memory),提出了一种提高访存速度的ldm_call算法。两种优化技术在共享存储结构下具有普遍适用性,并能够有效减少函数访存开销,提高访存速度。实验表明,两种技术分别能够平均提高函数性能16.08%和37.32%。
[1] Zhou Hui-yang,Conte T M.Performance modeling of memorylatency hiding techniques[R].Technical report,ECE Department,N.C.State University,January 2003 [2] Lebeck A R,Koppanalil J,Li T,et al.A large,fast instruction window for tolerating cache misses[C]∥Proceedings of the 29th International Symposium on Computer Architecture(ISCA’02).Anchorage,Alaska,USA,IEEE Computer Society,2002:59-70 [3] Wang P H,Wang H,Collins J D,et al.Memory latency-tolerance approaches for itanium processors:out-of-order executionvs.speculative precomputation[C]∥Proceedings of the 8th International Symposium on High Performance Computer Architecture(HPCA’02).Boston,Massachusettes,USA:IEEE Computer Society,2002:187-196 [4] Beyls K,D’Hollander E.Compiler generated multithreading to alleviate memory latency[J].Journal of Universal Computer Science,2000,6(10):968-993 [5] 贺红,朱大铭,马绍汉.用神经网络求解时间依赖网络最短路径问题的新算法[J].复旦学报:自然科学版,2004,3(5):714-716 [6] Raman E,Hundt R,Mannarswamy S.Structure layout optimiza-tion for multithreaded programs[C]∥Proceedings of the International Symposium on Code Generation and Optimization(CGO’07).San Jose:IEEE Computer Society,2007:271-282 [7] Lattner C,Adve V.Auto-matic pool allocation:improving performance by controlling data structure layout in the heap[C]∥Proceedings of the 2005ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI’05) .Chicago,IL,USA:ACM Press,2005:129-142 [8] 黄安文,高军,张民选.多核处理器非一致Cache体系结构延迟优化技术研究综述[J].计算机研究与发展,2012,49(S1):118-124 [9] 李浩,谢伦国.片上多处理器末级Cache优化技术研究综述[J].计算机研究与发展,2012,49(Supp1):172-179 [10] 余磊,刘志勇,宋风龙.LU分解在众核结构仿真器上的指令级调度研究[J].系统仿真学报,2011,3(12):2603-2610 [11] Allen R,Kennedy K.Optimizing Compilers for Modern Archi-tectures,A Dependence-Based Approach[M]∥Elsevier Science,2004:47-374 [12] Zhao Jie,Zhao Rong-cai,Han Lin.A Nonlinear Array Subscripts Dependence Test[C]∥Proceedings of the 2012IEEE 14th International Conference on High Performance Computing and Communications(HPCC’12).Liverpool,IEEE Computer Society,2012:764-771 [13] Rau B R,Fisher J A.Instruction level parallel-processing:history,overview and perspective[J].The Journal of Supercompu-ting,1993,7(1):950 [14] Garey M R,Johnson D S.Computers and Intractability:A Guide to the Theory of NP-Completeness[M].Freeman W H.Co,San Francisco,1979 |
No related articles found! |
|