计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 1-9.doi: 10.11896/jsjkx.220100134
程稳1, 李焱2, 曾令仿3, 王芳1, 唐士程2, 杨力平2, 冯丹1, 曾文君2
CHENG Wen1, LI Yan2, ZENG Ling-fang3, WANG Fang1, TANG Shi-cheng2, YANG Li-ping2, FENG Dan1, ZENG Wen-jun2
摘要: 集群存储系统的错误日志信息有助于优化存储系统的可用性和稳定性。现有存储系统错误探究主要针对单机存储系统或集群存储系统的部分功能进行分析评估,缺乏在实际应用场景下,同一生产环境中,长时间、多视角的探究工作。新型功能模块的不断融入,使得集群存储系统日益庞杂,集群存储系统自身引发的错误层出不穷,给各类研发人员带来了困扰与挑战。针对以上问题,提出了面向Lustre集群存储的错误日志分析及系统优化策略,通过收集连续1 673天的错误日志,研究了近2.26 GB的Lustre错误日志,分析了多个版本Lustre错误的特点与问题,揭示了集群存储系统各方面的不足与错误,研究了不同Lustre版本错误的影响因素,总结了Lustre集群在实际生产环境中的常见错误,并给出了相应的解决方案。对Lustre系统研发有了许多新的见解,并总结了14个发现,最后通过采集333天的新增错误记录对14个发现进行了相关验证,给出了一些系统错误优化实例。相关测试表明,优化实例可以显著减少错误数量,提高系统的可用性和稳定性,研究结果和建议对集群存储系统本身的发展以及集群存储系统的运行和维护都有一定的参考价值。
中图分类号:
[1]RASHAWN L K,TATYANA M,SUPADA L,et al.Preeti,Valgrind,AVX-512,and Intel HPC Analysis Tools[OL].https://slidetodoc.com/valgrind-avx512-and-intel-hpc-analysis-tools-7/. [2]AGHAYEV A,WEIL S,KUCHNIK M,et al.File systems unfit as distributed storage backends:lessons from 10 years of Ceph evolution[C]//The 27th ACM Symposium on Operating Systems Principles.2019:353-369. [3]DAI D,GATLA O R,ZHENG M.A Performance Study of Lustre File System Checker:Bottlenecks and Potentials[C]//The 35th Symposium on Mass Storage Systems and Technologies (MSST).IEEE,2019:7-13. [4]CIORBA F M.The importance and need for system monitoring and analysis in HPC operations and research[J].arXiv:1807.03112,2018. [5]BRAAM P.The Lustre storage architecture[J].arXiv:1903.01955.2019. [6]Lustre[OL].https://www.lustre.org/. [7]BRIM M J,LOTHIAN J K.Monitoring extreme-scale Lustretoolkit[OL].http://arxiv.org/abs/1504.06836. [8]LOCKWOOD G K,SNYDER S,WANG T,et al.A Year in the Life of a Parallel File System[C]//Proceedings of the International Conference for High Performance Computing.2018. [9]JI C,CHANG L P,SHI L,et al.An empirical study of file-sys-tem fragmentation in mobile storage systems[C]//The 8th {USENIX} Workshop on Hot Topics in Storage and File Systems(HotStorage 16).2016. [10]AGHAYEV A,WEIL S,KUCHNIK M,et al.File systems unfit as distributed storage backends:lessons from 10 years of Ceph evolution[C]//The 27th ACM Symposium on Operating Systems Principles.2019:353-369. [11]AGRAWAL N,BOLOSKY W J,DOUCEUR J R,et al.A five-year study of file-system metadata[J].ACM Transactions on Storage(TOS),2007,3(3):9:1-9:32. [12]LU L,ARPACI-DUSSEAU A C,ARPACI-DUSSEAU R H,et al.A study of Linux file system evolution[C]//The Presented as part of the 11th {USENIX} Conference on File and Storage Technologies({FAST} 13).2013:31-44. [13]LU L,ARPACI-DUSSEAU A C,ARPACI-DUSSEAU R H,et al.A study of Linux file system evolution[J].ACM Transactions on Storage(TOS),2014,10(1):1-32. [14]Lustre Projects[OL].https://wiki.lustre.org/Projects. [15]QIAN Y J,LI X,IHARA S,et al.LPCC:Hierarchical PersistentClient Caching for Lustre[C]//Proceedings of the International Conference for High Performance Computing,Networking,Sto-rage and Analysis.New York,NY,USA,2019,88:1-14. [16]CHENG W,LI C Y,ZENG L F,et al.NVMM-oriented Hierarchical Persistent Client Caching for Lustre[J].ACM Transactions on Storage(TOS),2021,17(1):1-22. [17]Lustre.Data on MDT Solution Architecture[OL].http://wiki.lustre.org/Data_on_MDT_Solution_Architecture. |
[1] | 卿朝进, 杜艳红, 叶青, 杨娜, 张岷涛. 存在CSI估计错误的增强型ELM叠加CSI反馈方法 Enhanced ELM-based Superimposed CSI Feedback Method with CSI Estimation Errors 计算机科学, 2022, 49(6A): 632-638. https://doi.org/10.11896/jsjkx.210800036 |
[2] | 王鑫, 周泽宝, 余芸, 陈禹旭, 任昊文, 蒋一波, 孙凌云. 一种面向电能量数据的联邦学习可靠性激励机制 Reliable Incentive Mechanism for Federated Learning of Electric Metering Data 计算机科学, 2022, 49(3): 31-38. https://doi.org/10.11896/jsjkx.210700195 |
[3] | 杨杏丽. 分类学习算法的性能度量指标综述 Survey for Performance Measure Index of Classification Learning Algorithm 计算机科学, 2021, 48(8): 209-219. https://doi.org/10.11896/jsjkx.200900216 |
[4] | 房婷, 宫傲宇, 张帆, 林艳, 贾林琼, 张一晋. 一种传输时限下认知无线电网络的动态广播策略 Dynamic Broadcasting Strategy in Cognitive Radio Networks Under Delivery Deadline 计算机科学, 2021, 48(7): 340-346. https://doi.org/10.11896/jsjkx.200900001 |
[5] | 张慧. 基于程序变异和高斯混合聚类的错误定位技术 Fault Localization Technology Based on Program Mutation and Gaussian Mixture Model 计算机科学, 2021, 48(6A): 572-574. https://doi.org/10.11896/jsjkx.200500121 |
[6] | 徐建波, 舒辉, 康绯. 反向调试技术研究综述 Summary on Reverse Debugging Technology 计算机科学, 2021, 48(5): 9-15. https://doi.org/10.11896/jsjkx.200600152 |
[7] | 亓慧, 史颖, 李灯熬, 穆晓芳, 侯明星. 基于连续型深度置信神经网络的软件可靠性预测 Software Reliability Prediction Based on Continuous Deep Confidence Neural Network 计算机科学, 2021, 48(5): 86-90. https://doi.org/10.11896/jsjkx.210200055 |
[8] | 冯凯, 马鑫玉. (n,k)-冒泡排序网络的子网络可靠性 Subnetwork Reliability of (n,k)-bubble-sort Networks 计算机科学, 2021, 48(4): 43-48. https://doi.org/10.11896/jsjkx.201100139 |
[9] | 文进, 张星宇, 沙朝锋, 刘艳君. 基于次模函数最大化的测试用例集约简 Test Suite Reduction via Submodular Function Maximization 计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086 |
[10] | 张慧. 基于深度卷积网络的多错误定位方法 Multiple Fault Localization Method Based on Deep Convolutional Network 计算机科学, 2021, 48(11A): 88-92. https://doi.org/10.11896/jsjkx.210200096 |
[11] | 冯凯, 李婧. k元n方体的子网络可靠性研究 Study on Subnetwork Reliability of k-ary n-cubes 计算机科学, 2020, 47(7): 31-36. https://doi.org/10.11896/jsjkx.190700170 |
[12] | 王慧妍, 徐经纬, 许畅. 环境感知自适应软件的运行时输入验证技术综述 Survey on Runtime Input Validation for Context-aware Adaptive Software 计算机科学, 2020, 47(6): 1-7. https://doi.org/10.11896/jsjkx.200400081 |
[13] | 程煜, 刘伟, 孙童心, 魏志刚, 杜薇. 近阈值电压下可容错的一级缓存结构设计 Design of Fault-tolerant L1 Cache Architecture at Near-threshold Voltage 计算机科学, 2020, 47(4): 42-49. https://doi.org/10.11896/jsjkx.190300088 |
[14] | 李苏婷,张严. GSOS算子下共变-异变模拟的公理刻画 Axiomatizing Covariation-Contravariation Simulation Under GSOS Operators 计算机科学, 2020, 47(1): 51-58. https://doi.org/10.11896/jsjkx.181102026 |
[15] | 杨立鹏, 张仰森, 张雯, 王建, 曾健荣. 基于Storm实时流式计算框架的网络日志分析方法 Web Log Analysis Method Based on Storm Real-time Streaming Computing Framework 计算机科学, 2019, 46(9): 176-183. https://doi.org/10.11896/j.issn.1002-137X.2019.09.025 |
|