计算机科学 ›› 2011, Vol. 38 ›› Issue (5): 287-289.

• 体系结构 • 上一篇    下一篇

面向大规模计算系统的Cache式并行检查点

刘勇燕,刘勇鹏,冯华,迟万庆   

  1. (科学技术部信息中心 北京100862) (国防科学技术大学计算机学院 长沙410073)
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受高效能服务器和存储技术国家重点实验室开放基金项目(2009HSSA04)资助。

Cache-style Parallel Checkpointing for Large-scale Computing System

LIU Yong-yan,LIU Yong-peng,FENG Hua,CHI Wan-qing   

  • Online:2018-11-16 Published:2018-11-16

摘要: 检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。

关键词: Cache式检查点,并行计算,多级文件系统,多处理器,乱序流水线

Abstract: Checkpointing is a typical technique for fault tolerance, whereas its scalability is limited by the overhead of file access. According to the multi level file system architecture, the cache-style parallel checkpointing was introduced,which translates global coordinated checkpointing into local file operation by out of-order pipelining of checkpoint flushing opportunity. The overhead of writcback is hidden effectively to increase the performance and the scalability of parallel checkpointing.

Key words: Cachcstylc checkpointing, Parallel computing, Multi-level file system, Multi-processor, Out-of-order pipeline

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!