计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 1-9.doi: 10.11896/jsjkx.220100134

• 高性能计算* 上一篇    下一篇

面向Lustre集群存储的错误日志分析及系统优化

程稳1, 李焱2, 曾令仿3, 王芳1, 唐士程2, 杨力平2, 冯丹1, 曾文君2   

  1. 1 华中科技大学武汉光电国家研究中心信息存储系统教育部重点实验室暨数据存储系统与技术教育部工程研究中心 武汉 430074
    2 深圳国家基因库 广东 深圳 518120
    3 之江实验室 杭州 311121
  • 收稿日期:2022-01-14 修回日期:2022-05-06 出版日期:2022-10-15 发布日期:2022-10-13
  • 通讯作者: 王芳(wangfang@hust.edu.cn)
  • 作者简介:(chengwen@hust.edu.cn)
  • 基金资助:
    国家自然科学基金重点项目(61832020);国家自然科学基金创新研究群体项目(61821003);之江实验室中心自设科研项目(2021DA0AM01)

Error Log Analysis and System Optimization for Lustre Cluster Storage

CHENG Wen1, LI Yan2, ZENG Ling-fang3, WANG Fang1, TANG Shi-cheng2, YANG Li-ping2, FENG Dan1, ZENG Wen-jun2   

  1. 1 Wuhan National Laboratory for Optoelectronics,Key Laboratory of Information Storage System,Engineering Research Center of Data Storage Systems and Technology,Huazhong University of Science and Technology,Ministry of Education of China,Wuhan 430074,China
    2 China National GeneBank,BGI-Shenzhen,Shenzhen,Guangdong 518120,China
    3 Zhejiang Lab,Hangzhou 311121,China
  • Received:2022-01-14 Revised:2022-05-06 Online:2022-10-15 Published:2022-10-13
  • About author:CHENG Wen,born in 1989,Ph.D candidate,is a student member of China Computer Federation.His main research interests include distributed storage system,data mining,system optimization,and data privacy protection.
    WANG Fang,born in 1972.Ph.D,professor,Ph.D supervisor,is a senior member of China Computer Federation.Her main research interests include distributed storage system,parallel file system,new storage system based on non-volatile memory devices,software defined storage,and large-scale graph data storage and processing.
  • Supported by:
    National Natural Science Foundation of China(NSFC)(61832020),Creative Research Group Project of NSFC(61821003) and Center-initiated Research Project of Zhejiang Lab(2021DA0AM01).

摘要: 集群存储系统的错误日志信息有助于优化存储系统的可用性和稳定性。现有存储系统错误探究主要针对单机存储系统或集群存储系统的部分功能进行分析评估,缺乏在实际应用场景下,同一生产环境中,长时间、多视角的探究工作。新型功能模块的不断融入,使得集群存储系统日益庞杂,集群存储系统自身引发的错误层出不穷,给各类研发人员带来了困扰与挑战。针对以上问题,提出了面向Lustre集群存储的错误日志分析及系统优化策略,通过收集连续1 673天的错误日志,研究了近2.26 GB的Lustre错误日志,分析了多个版本Lustre错误的特点与问题,揭示了集群存储系统各方面的不足与错误,研究了不同Lustre版本错误的影响因素,总结了Lustre集群在实际生产环境中的常见错误,并给出了相应的解决方案。对Lustre系统研发有了许多新的见解,并总结了14个发现,最后通过采集333天的新增错误记录对14个发现进行了相关验证,给出了一些系统错误优化实例。相关测试表明,优化实例可以显著减少错误数量,提高系统的可用性和稳定性,研究结果和建议对集群存储系统本身的发展以及集群存储系统的运行和维护都有一定的参考价值。

关键词: Lustre文件系统, 日志分析, 系统优化, 错误, 可靠性

Abstract: Cluster storage system error messages can help to optimize the availability and reliability of storage system.Previous research of storage system error analysis focuses on the local file system or a part of the cluster storage system.There is a lack of research on storage system error messages for a long-time and multi-dimension in practical applications.With the continuous integration of new functional modules,the cluster storage system is becoming more and more complex,and the errors caused by cluster storage system emerge endlessly,which brings troubles and challenges to the researcher and developer.To address the pro-blems,we conduct a comprehensive study of the Lustre system error log.By collecting the error log in 1 673 consecutive days,we study nearly 2.26 GB of Lustre error logs,analyze the characteristics and problems of the Lustre system errors in multiple Lustre versions.We show that correlated errors between different subsystems and study the possible impacting factors on different Lustre versions.We also summarize the common errors in the Lustre system and show the corresponding solutions.We derive nume-rous new insights into the Lustre system development process and report 14 findings.Finally,we collect new error logs for 333 consecutive days to verify the 14 findings and give some cases about error optimization.Experimental results show that the error optimization cases can significantly reduce the number of errors and improve the availability and stability of the system.Our results and suggestions should be useful for both the development of the cluster storage system themselves as well as the Lustre operation and maintenance.

Key words: Lustre file system, Log analysis, System optimization, Bug, Reliability

中图分类号: 

  • TP399
[1]RASHAWN L K,TATYANA M,SUPADA L,et al.Preeti,Valgrind,AVX-512,and Intel HPC Analysis Tools[OL].https://slidetodoc.com/valgrind-avx512-and-intel-hpc-analysis-tools-7/.
[2]AGHAYEV A,WEIL S,KUCHNIK M,et al.File systems unfit as distributed storage backends:lessons from 10 years of Ceph evolution[C]//The 27th ACM Symposium on Operating Systems Principles.2019:353-369.
[3]DAI D,GATLA O R,ZHENG M.A Performance Study of Lustre File System Checker:Bottlenecks and Potentials[C]//The 35th Symposium on Mass Storage Systems and Technologies (MSST).IEEE,2019:7-13.
[4]CIORBA F M.The importance and need for system monitoring and analysis in HPC operations and research[J].arXiv:1807.03112,2018.
[5]BRAAM P.The Lustre storage architecture[J].arXiv:1903.01955.2019.
[6]Lustre[OL].https://www.lustre.org/.
[7]BRIM M J,LOTHIAN J K.Monitoring extreme-scale Lustretoolkit[OL].http://arxiv.org/abs/1504.06836.
[8]LOCKWOOD G K,SNYDER S,WANG T,et al.A Year in the Life of a Parallel File System[C]//Proceedings of the International Conference for High Performance Computing.2018.
[9]JI C,CHANG L P,SHI L,et al.An empirical study of file-sys-tem fragmentation in mobile storage systems[C]//The 8th {USENIX} Workshop on Hot Topics in Storage and File Systems(HotStorage 16).2016.
[10]AGHAYEV A,WEIL S,KUCHNIK M,et al.File systems unfit as distributed storage backends:lessons from 10 years of Ceph evolution[C]//The 27th ACM Symposium on Operating Systems Principles.2019:353-369.
[11]AGRAWAL N,BOLOSKY W J,DOUCEUR J R,et al.A five-year study of file-system metadata[J].ACM Transactions on Storage(TOS),2007,3(3):9:1-9:32.
[12]LU L,ARPACI-DUSSEAU A C,ARPACI-DUSSEAU R H,et al.A study of Linux file system evolution[C]//The Presented as part of the 11th {USENIX} Conference on File and Storage Technologies({FAST} 13).2013:31-44.
[13]LU L,ARPACI-DUSSEAU A C,ARPACI-DUSSEAU R H,et al.A study of Linux file system evolution[J].ACM Transactions on Storage(TOS),2014,10(1):1-32.
[14]Lustre Projects[OL].https://wiki.lustre.org/Projects.
[15]QIAN Y J,LI X,IHARA S,et al.LPCC:Hierarchical PersistentClient Caching for Lustre[C]//Proceedings of the International Conference for High Performance Computing,Networking,Sto-rage and Analysis.New York,NY,USA,2019,88:1-14.
[16]CHENG W,LI C Y,ZENG L F,et al.NVMM-oriented Hierarchical Persistent Client Caching for Lustre[J].ACM Transactions on Storage(TOS),2021,17(1):1-22.
[17]Lustre.Data on MDT Solution Architecture[OL].http://wiki.lustre.org/Data_on_MDT_Solution_Architecture.
[1] 卿朝进, 杜艳红, 叶青, 杨娜, 张岷涛.
存在CSI估计错误的增强型ELM叠加CSI反馈方法
Enhanced ELM-based Superimposed CSI Feedback Method with CSI Estimation Errors
计算机科学, 2022, 49(6A): 632-638. https://doi.org/10.11896/jsjkx.210800036
[2] 王鑫, 周泽宝, 余芸, 陈禹旭, 任昊文, 蒋一波, 孙凌云.
一种面向电能量数据的联邦学习可靠性激励机制
Reliable Incentive Mechanism for Federated Learning of Electric Metering Data
计算机科学, 2022, 49(3): 31-38. https://doi.org/10.11896/jsjkx.210700195
[3] 杨杏丽.
分类学习算法的性能度量指标综述
Survey for Performance Measure Index of Classification Learning Algorithm
计算机科学, 2021, 48(8): 209-219. https://doi.org/10.11896/jsjkx.200900216
[4] 房婷, 宫傲宇, 张帆, 林艳, 贾林琼, 张一晋.
一种传输时限下认知无线电网络的动态广播策略
Dynamic Broadcasting Strategy in Cognitive Radio Networks Under Delivery Deadline
计算机科学, 2021, 48(7): 340-346. https://doi.org/10.11896/jsjkx.200900001
[5] 张慧.
基于程序变异和高斯混合聚类的错误定位技术
Fault Localization Technology Based on Program Mutation and Gaussian Mixture Model
计算机科学, 2021, 48(6A): 572-574. https://doi.org/10.11896/jsjkx.200500121
[6] 徐建波, 舒辉, 康绯.
反向调试技术研究综述
Summary on Reverse Debugging Technology
计算机科学, 2021, 48(5): 9-15. https://doi.org/10.11896/jsjkx.200600152
[7] 亓慧, 史颖, 李灯熬, 穆晓芳, 侯明星.
基于连续型深度置信神经网络的软件可靠性预测
Software Reliability Prediction Based on Continuous Deep Confidence Neural Network
计算机科学, 2021, 48(5): 86-90. https://doi.org/10.11896/jsjkx.210200055
[8] 冯凯, 马鑫玉.
(n,k)-冒泡排序网络的子网络可靠性
Subnetwork Reliability of (n,k)-bubble-sort Networks
计算机科学, 2021, 48(4): 43-48. https://doi.org/10.11896/jsjkx.201100139
[9] 文进, 张星宇, 沙朝锋, 刘艳君.
基于次模函数最大化的测试用例集约简
Test Suite Reduction via Submodular Function Maximization
计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086
[10] 张慧.
基于深度卷积网络的多错误定位方法
Multiple Fault Localization Method Based on Deep Convolutional Network
计算机科学, 2021, 48(11A): 88-92. https://doi.org/10.11896/jsjkx.210200096
[11] 冯凯, 李婧.
k元n方体的子网络可靠性研究
Study on Subnetwork Reliability of k-ary n-cubes
计算机科学, 2020, 47(7): 31-36. https://doi.org/10.11896/jsjkx.190700170
[12] 王慧妍, 徐经纬, 许畅.
环境感知自适应软件的运行时输入验证技术综述
Survey on Runtime Input Validation for Context-aware Adaptive Software
计算机科学, 2020, 47(6): 1-7. https://doi.org/10.11896/jsjkx.200400081
[13] 程煜, 刘伟, 孙童心, 魏志刚, 杜薇.
近阈值电压下可容错的一级缓存结构设计
Design of Fault-tolerant L1 Cache Architecture at Near-threshold Voltage
计算机科学, 2020, 47(4): 42-49. https://doi.org/10.11896/jsjkx.190300088
[14] 李苏婷,张严.
GSOS算子下共变-异变模拟的公理刻画
Axiomatizing Covariation-Contravariation Simulation Under GSOS Operators
计算机科学, 2020, 47(1): 51-58. https://doi.org/10.11896/jsjkx.181102026
[15] 杨立鹏, 张仰森, 张雯, 王建, 曾健荣.
基于Storm实时流式计算框架的网络日志分析方法
Web Log Analysis Method Based on Storm Real-time Streaming Computing Framework
计算机科学, 2019, 46(9): 176-183. https://doi.org/10.11896/j.issn.1002-137X.2019.09.025
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!