Computer Science ›› 2015, Vol. 42 ›› Issue (9): 177-182.doi: 10.11896/j.issn.1002-137X.2015.09.034

Previous Articles     Next Articles

Improvement of Recovery Mechanism for Lustre Metadata Service

QIAN Ying-jin, LI Yong-gang, WANG Yi and ZHOU Lin-qi   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Lustre reboot recovery algorithm needs that all clients reconnect to the server in a special recovery time window,and then clients resend uncommitted transactional requests and the server replays these requests strictly in the transaction number order.The recovery conditions are too strict.To improve Lustre’s recoverability and availability,this paper proposed version based recovery and commit on share algorithms.They extend Lustre’s metadata update algorithm and recovery algorithm respectively,and allow clients rejoin in the cluster by recovery under a more relaxed condition according to the dependence between transactions.At last,the performance of improved recovery algorithms was evaluated via a series of experiments.

Key words: Lustre,HPC,Recoverability,Availability

[1] Patterson D.Availability and Maintainability >> Performance:New Focus for a New Century [EB/OL].http://usenix.org/events/fast02/patterson/sld001.htm
[2] 钱迎进.大规模Lustre集群文件系统关键技术的研究 [D].长沙:国防科技大学,2011:97-118 Qian Ying-jin.Research on Key Issues in Large Scale Clustered File System Lustre [D].Changsha:National University of Defense Technology,2011:97-118
[3] 李晖.基于日志的集群文件系统高可用关键技术研究[D].北京:中国科学院计算技术研究所,2005:9-10 Li Hui.Research on Journal Based High-availability Mechanism of Cluster File Systems [D].Beijing:Institute of Computing Technology,Chinese Academy of Science,2005:9-10
[4] 钱迎进,伊瑞海,肖侬,等.Lustre文件系统元数据服务恢复机制研究 [J].高性能计算技术(第19届全国信息存储技术大会收录),2013,255(6):10-16Qian Ying-jin,Yi Rui-hai,Xiao Nong,et al.Research on Reco-very Mechanism for Lustre Metadata Service [J].High Perfor-mance Computing Technology (19th Annual National Conference on Information and Storage Technology),2013,255(6):10-16
[5] Bhide A,Elnozahy E N,Morgan S P.A Highly Available Network File Server [C]∥Proceedings of the Usenix Winter 1991 Conference.Dallas,TX,USA:USENIX Association,1991:199-205
[6] Devarakonda M,Kish B,Mohindra A.Recovery in the CalypsoFile System [J].ACM Transaction on Computer Systems,1996,14(3):287-310
[7] Mogul J C.Recovery in Spritely NFS [J].Computing Systems,the Journal of the USENIX Association,Spring,1994,7(2):201-262
[8] Baker M,Ousterhout J.Availability in the Sprite DistributedFile System [J].Operating Systems Review,1991,25(2):95-98
[9] Welch B,Baker M,Douglis F,et al.Sprite Position Statement:Use Distributed State for failure Recovery [C]∥Proceeding of the Second Workshop on Workstation Operating System.Pacific Grove,CA,USA:IEEE Computer Society,1989:130-133
[10] Baker M.Fast Crash Recovery in Distributed File Systems [D].California:University of California at Berkeley,1994:34-104
[11] Kistler J,Satyanarayanan M.Disconnected Operation in the Coda File System [J].ACM Transactions on Computer Systems,1992,10(1):3-25
[12] 钱迎进,金士尧,肖侬.Lustre文件系统I/O锁的应用与优化[J].计算机工程与应用,2011,47(3):1-5,26 Qian Ying-jin,Jin Shi-yao,Xiao Nong.Application and Optimization for Lustre File I/O Locking [J].Computer Engineering and Applications,2011,47(3):1-5,26
[13] 钱迎进,肖侬,金士尧.Lustre分布式锁管理器的分析与改进 [J].计算机工程与科学,2009(S1):146-149 Qian Ying-jin,Xiao Nong,Jin Shi-yao.Analysis and Improvement of Lustre Distributed Lock Manager [J].Computer Engineering & Science,2009(S1):146-149

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!