计算机科学 ›› 2014, Vol. 41 ›› Issue (6): 12-17.doi: 10.11896/j.issn.1002-137X.2014.06.003

• 综述 • 上一篇    下一篇

面向异构众核从核的数学函数库访存优化方法

许瑾晨,郭绍忠,黄永忠,王磊   

  1. 解放军信息工程大学数学工程与先进计算国家重点实验室 郑州450002;解放军信息工程大学数学工程与先进计算国家重点实验室 郑州450002;解放军信息工程大学数学工程与先进计算国家重点实验室 郑州450002;解放军信息工程大学数学工程与先进计算国家重点实验室 郑州450002
  • 出版日期:2018-11-14 发布日期:2018-11-14

Access Optimization Technique for Mathematical Library of Slave Processors on Heterogeneous Many-core Architectures

XU Jin-chen,GUO Shao-zhong,HUANG Yong-zhong and WANG Lei   

  • Online:2018-11-14 Published:2018-11-14

摘要: 数学库函数算法的特性致使函数存在大量的访存,而当前异构众核的从核结构采用共享主存的方式实现数据访问,从而严重影响了从核的访存速度,因此异构众核结构中数学库函数的性能无法满足高性能计算的要求。为了有效解决此问题,提出了一种基于访存指令的调度策略,亦即将访存延迟有效地隐藏于计算延迟中,以提高基于汇编实现的数学函数库的函数性能;结合动态调用方式,利用从核本地局部数据存储空间LDM(local data memory),提出了一种提高访存速度的ldm_call算法。两种优化技术在共享存储结构下具有普遍适用性,并能够有效减少函数访存开销,提高访存速度。实验表明,两种技术分别能够平均提高函数性能16.08%和37.32%。

关键词: 异构众核,数学函数库,访存优化,指令调度,局部数据存储空间 中图法分类号TP311文献标识码A

Abstract: Due to the nature of mathematical function’s algorithms,there are a great deal of access operations remaining in reality.In the heterogeneous many-core architectures,which is becoming ubiquitous recently,the slave processors are equipped with shared memory to access data,thereby impacting the accessing rate heavily.Therefore,the performance of the mathematical library’s functions is not able to meet requirements of high performance computing.To efficiently solve this problem,this study proposesd a novel accessing instructions based scheduling strategy to cover the access delay with the necessary computation.With the help of the dynamic calling mode,an algorithm called ldm_call was introduced based on the LDM (local data memory) of the slave processors,which can speed up the accessing rate significantly.These two optimizing technologies both possess general applicability in the shared memory.At the same time,they can efficiently reduce the accessing frequency and speed up the accessing rate.The experimental results show that they can improve the functions’ performance 16.08% and 37.32% on average respectively.

Key words: Heterogeneous many-core,Mathematical library,Access optimization,Instruction-scheduling,Local data memory

[1] Zhou Hui-yang,Conte T M.Performance modeling of memorylatency hiding techniques[R].Technical report,ECE Department,N.C.State University,January 2003
[2] Lebeck A R,Koppanalil J,Li T,et al.A large,fast instruction window for tolerating cache misses[C]∥Proceedings of the 29th International Symposium on Computer Architecture(ISCA’02).Anchorage,Alaska,USA,IEEE Computer Society,2002:59-70
[3] Wang P H,Wang H,Collins J D,et al.Memory latency-tolerance approaches for itanium processors:out-of-order executionvs.speculative precomputation[C]∥Proceedings of the 8th International Symposium on High Performance Computer Architecture(HPCA’02).Boston,Massachusettes,USA:IEEE Computer Society,2002:187-196
[4] Beyls K,D’Hollander E.Compiler generated multithreading to alleviate memory latency[J].Journal of Universal Computer Science,2000,6(10):968-993
[5] 贺红,朱大铭,马绍汉.用神经网络求解时间依赖网络最短路径问题的新算法[J].复旦学报:自然科学版,2004,3(5):714-716
[6] Raman E,Hundt R,Mannarswamy S.Structure layout optimiza-tion for multithreaded programs[C]∥Proceedings of the International Symposium on Code Generation and Optimization(CGO’07).San Jose:IEEE Computer Society,2007:271-282
[7] Lattner C,Adve V.Auto-matic pool allocation:improving performance by controlling data structure layout in the heap[C]∥Proceedings of the 2005ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI’05) .Chicago,IL,USA:ACM Press,2005:129-142
[8] 黄安文,高军,张民选.多核处理器非一致Cache体系结构延迟优化技术研究综述[J].计算机研究与发展,2012,49(S1):118-124
[9] 李浩,谢伦国.片上多处理器末级Cache优化技术研究综述[J].计算机研究与发展,2012,49(Supp1):172-179
[10] 余磊,刘志勇,宋风龙.LU分解在众核结构仿真器上的指令级调度研究[J].系统仿真学报,2011,3(12):2603-2610
[11] Allen R,Kennedy K.Optimizing Compilers for Modern Archi-tectures,A Dependence-Based Approach[M]∥Elsevier Science,2004:47-374
[12] Zhao Jie,Zhao Rong-cai,Han Lin.A Nonlinear Array Subscripts Dependence Test[C]∥Proceedings of the 2012IEEE 14th International Conference on High Performance Computing and Communications(HPCC’12).Liverpool,IEEE Computer Society,2012:764-771
[13] Rau B R,Fisher J A.Instruction level parallel-processing:history,overview and perspective[J].The Journal of Supercompu-ting,1993,7(1):950
[14] Garey M R,Johnson D S.Computers and Intractability:A Guide to the Theory of NP-Completeness[M].Freeman W H.Co,San Francisco,1979

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!