计算机科学 ›› 2015, Vol. 42 ›› Issue (9): 177-182.doi: 10.11896/j.issn.1002-137X.2015.09.034

• 软件与数据库技术 • 上一篇    下一篇

Lustre文件系统元数据服务恢复机制的改进

钱迎进,李永刚,汪毅,周琳琦   

  1. 中国卫星海上测控部技术部 江阴214431,中国卫星海上测控部技术部 江阴214431,中国卫星海上测控部技术部 江阴214431,中国卫星海上测控部技术部 江阴214431
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家973计划资助

Improvement of Recovery Mechanism for Lustre Metadata Service

QIAN Ying-jin, LI Yong-gang, WANG Yi and ZHOU Lin-qi   

  • Online:2018-11-14 Published:2018-11-14

摘要: Lustre的重启恢复算法需要集群中所有客户端在指定的恢复时间窗口内与服务器重新建立连接,客户端重传未提交的事务请求,服务器严格按照事务序列号重放所有未提交的事务,要求过于严格。针对Lustre可恢复性不强的缺点,提出了基于版本的恢复和共享时提交算法,它们分别对Lustre现有的元数据更新和恢复机制进行了改进和扩展,根据事务之间的依赖关系,允许客户端在更为宽松的条件下进行恢复并加入到集群而不被驱逐,提高了Lustre文件系统的可用性和可恢复性。最后通过一系列实验对改进后的算法的性能进行了评估。

关键词: Lustre,高性能计算,可恢复性,可用性

Abstract: Lustre reboot recovery algorithm needs that all clients reconnect to the server in a special recovery time window,and then clients resend uncommitted transactional requests and the server replays these requests strictly in the transaction number order.The recovery conditions are too strict.To improve Lustre’s recoverability and availability,this paper proposed version based recovery and commit on share algorithms.They extend Lustre’s metadata update algorithm and recovery algorithm respectively,and allow clients rejoin in the cluster by recovery under a more relaxed condition according to the dependence between transactions.At last,the performance of improved recovery algorithms was evaluated via a series of experiments.

Key words: Lustre,HPC,Recoverability,Availability

[1] Patterson D.Availability and Maintainability >> Performance:New Focus for a New Century [EB/OL].http://usenix.org/events/fast02/patterson/sld001.htm
[2] 钱迎进.大规模Lustre集群文件系统关键技术的研究 [D].长沙:国防科技大学,2011:97-118 Qian Ying-jin.Research on Key Issues in Large Scale Clustered File System Lustre [D].Changsha:National University of Defense Technology,2011:97-118
[3] 李晖.基于日志的集群文件系统高可用关键技术研究[D].北京:中国科学院计算技术研究所,2005:9-10 Li Hui.Research on Journal Based High-availability Mechanism of Cluster File Systems [D].Beijing:Institute of Computing Technology,Chinese Academy of Science,2005:9-10
[4] 钱迎进,伊瑞海,肖侬,等.Lustre文件系统元数据服务恢复机制研究 [J].高性能计算技术(第19届全国信息存储技术大会收录),2013,255(6):10-16Qian Ying-jin,Yi Rui-hai,Xiao Nong,et al.Research on Reco-very Mechanism for Lustre Metadata Service [J].High Perfor-mance Computing Technology (19th Annual National Conference on Information and Storage Technology),2013,255(6):10-16
[5] Bhide A,Elnozahy E N,Morgan S P.A Highly Available Network File Server [C]∥Proceedings of the Usenix Winter 1991 Conference.Dallas,TX,USA:USENIX Association,1991:199-205
[6] Devarakonda M,Kish B,Mohindra A.Recovery in the CalypsoFile System [J].ACM Transaction on Computer Systems,1996,14(3):287-310
[7] Mogul J C.Recovery in Spritely NFS [J].Computing Systems,the Journal of the USENIX Association,Spring,1994,7(2):201-262
[8] Baker M,Ousterhout J.Availability in the Sprite DistributedFile System [J].Operating Systems Review,1991,25(2):95-98
[9] Welch B,Baker M,Douglis F,et al.Sprite Position Statement:Use Distributed State for failure Recovery [C]∥Proceeding of the Second Workshop on Workstation Operating System.Pacific Grove,CA,USA:IEEE Computer Society,1989:130-133
[10] Baker M.Fast Crash Recovery in Distributed File Systems [D].California:University of California at Berkeley,1994:34-104
[11] Kistler J,Satyanarayanan M.Disconnected Operation in the Coda File System [J].ACM Transactions on Computer Systems,1992,10(1):3-25
[12] 钱迎进,金士尧,肖侬.Lustre文件系统I/O锁的应用与优化[J].计算机工程与应用,2011,47(3):1-5,26 Qian Ying-jin,Jin Shi-yao,Xiao Nong.Application and Optimization for Lustre File I/O Locking [J].Computer Engineering and Applications,2011,47(3):1-5,26
[13] 钱迎进,肖侬,金士尧.Lustre分布式锁管理器的分析与改进 [J].计算机工程与科学,2009(S1):146-149 Qian Ying-jin,Xiao Nong,Jin Shi-yao.Analysis and Improvement of Lustre Distributed Lock Manager [J].Computer Engineering & Science,2009(S1):146-149

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!