Computer Science ›› 2022, Vol. 49 ›› Issue (10): 1-9.doi: 10.11896/jsjkx.220100134

• High Perfonnance Computing • Previous Articles     Next Articles

Error Log Analysis and System Optimization for Lustre Cluster Storage

CHENG Wen1, LI Yan2, ZENG Ling-fang3, WANG Fang1, TANG Shi-cheng2, YANG Li-ping2, FENG Dan1, ZENG Wen-jun2   

  1. 1 Wuhan National Laboratory for Optoelectronics,Key Laboratory of Information Storage System,Engineering Research Center of Data Storage Systems and Technology,Huazhong University of Science and Technology,Ministry of Education of China,Wuhan 430074,China
    2 China National GeneBank,BGI-Shenzhen,Shenzhen,Guangdong 518120,China
    3 Zhejiang Lab,Hangzhou 311121,China
  • Received:2022-01-14 Revised:2022-05-06 Online:2022-10-15 Published:2022-10-13
  • About author:CHENG Wen,born in 1989,Ph.D candidate,is a student member of China Computer Federation.His main research interests include distributed storage system,data mining,system optimization,and data privacy protection.
    WANG Fang,born in 1972.Ph.D,professor,Ph.D supervisor,is a senior member of China Computer Federation.Her main research interests include distributed storage system,parallel file system,new storage system based on non-volatile memory devices,software defined storage,and large-scale graph data storage and processing.
  • Supported by:
    National Natural Science Foundation of China(NSFC)(61832020),Creative Research Group Project of NSFC(61821003) and Center-initiated Research Project of Zhejiang Lab(2021DA0AM01).

Abstract: Cluster storage system error messages can help to optimize the availability and reliability of storage system.Previous research of storage system error analysis focuses on the local file system or a part of the cluster storage system.There is a lack of research on storage system error messages for a long-time and multi-dimension in practical applications.With the continuous integration of new functional modules,the cluster storage system is becoming more and more complex,and the errors caused by cluster storage system emerge endlessly,which brings troubles and challenges to the researcher and developer.To address the pro-blems,we conduct a comprehensive study of the Lustre system error log.By collecting the error log in 1 673 consecutive days,we study nearly 2.26 GB of Lustre error logs,analyze the characteristics and problems of the Lustre system errors in multiple Lustre versions.We show that correlated errors between different subsystems and study the possible impacting factors on different Lustre versions.We also summarize the common errors in the Lustre system and show the corresponding solutions.We derive nume-rous new insights into the Lustre system development process and report 14 findings.Finally,we collect new error logs for 333 consecutive days to verify the 14 findings and give some cases about error optimization.Experimental results show that the error optimization cases can significantly reduce the number of errors and improve the availability and stability of the system.Our results and suggestions should be useful for both the development of the cluster storage system themselves as well as the Lustre operation and maintenance.

Key words: Lustre file system, Log analysis, System optimization, Bug, Reliability

CLC Number: 

  • TP399
[1]RASHAWN L K,TATYANA M,SUPADA L,et al.Preeti,Valgrind,AVX-512,and Intel HPC Analysis Tools[OL].https://slidetodoc.com/valgrind-avx512-and-intel-hpc-analysis-tools-7/.
[2]AGHAYEV A,WEIL S,KUCHNIK M,et al.File systems unfit as distributed storage backends:lessons from 10 years of Ceph evolution[C]//The 27th ACM Symposium on Operating Systems Principles.2019:353-369.
[3]DAI D,GATLA O R,ZHENG M.A Performance Study of Lustre File System Checker:Bottlenecks and Potentials[C]//The 35th Symposium on Mass Storage Systems and Technologies (MSST).IEEE,2019:7-13.
[4]CIORBA F M.The importance and need for system monitoring and analysis in HPC operations and research[J].arXiv:1807.03112,2018.
[5]BRAAM P.The Lustre storage architecture[J].arXiv:1903.01955.2019.
[6]Lustre[OL].https://www.lustre.org/.
[7]BRIM M J,LOTHIAN J K.Monitoring extreme-scale Lustretoolkit[OL].http://arxiv.org/abs/1504.06836.
[8]LOCKWOOD G K,SNYDER S,WANG T,et al.A Year in the Life of a Parallel File System[C]//Proceedings of the International Conference for High Performance Computing.2018.
[9]JI C,CHANG L P,SHI L,et al.An empirical study of file-sys-tem fragmentation in mobile storage systems[C]//The 8th {USENIX} Workshop on Hot Topics in Storage and File Systems(HotStorage 16).2016.
[10]AGHAYEV A,WEIL S,KUCHNIK M,et al.File systems unfit as distributed storage backends:lessons from 10 years of Ceph evolution[C]//The 27th ACM Symposium on Operating Systems Principles.2019:353-369.
[11]AGRAWAL N,BOLOSKY W J,DOUCEUR J R,et al.A five-year study of file-system metadata[J].ACM Transactions on Storage(TOS),2007,3(3):9:1-9:32.
[12]LU L,ARPACI-DUSSEAU A C,ARPACI-DUSSEAU R H,et al.A study of Linux file system evolution[C]//The Presented as part of the 11th {USENIX} Conference on File and Storage Technologies({FAST} 13).2013:31-44.
[13]LU L,ARPACI-DUSSEAU A C,ARPACI-DUSSEAU R H,et al.A study of Linux file system evolution[J].ACM Transactions on Storage(TOS),2014,10(1):1-32.
[14]Lustre Projects[OL].https://wiki.lustre.org/Projects.
[15]QIAN Y J,LI X,IHARA S,et al.LPCC:Hierarchical PersistentClient Caching for Lustre[C]//Proceedings of the International Conference for High Performance Computing,Networking,Sto-rage and Analysis.New York,NY,USA,2019,88:1-14.
[16]CHENG W,LI C Y,ZENG L F,et al.NVMM-oriented Hierarchical Persistent Client Caching for Lustre[J].ACM Transactions on Storage(TOS),2021,17(1):1-22.
[17]Lustre.Data on MDT Solution Architecture[OL].http://wiki.lustre.org/Data_on_MDT_Solution_Architecture.
[1] ZHANG Zhi-long, SHI Xian-jun, QIN Yu-feng. Diagnosis Strategy Optimization Method Based on Improved Quasi Depth Algorithm [J]. Computer Science, 2022, 49(6A): 729-732.
[2] ZHAO Jing-wen, FU Yan, WU Yan-xia, CHEN Jun-wen, FENG Yun, DONG Ji-bin, LIU Jia-qi. Survey on Multithreaded Data Race Detection Techniques [J]. Computer Science, 2022, 49(6): 89-98.
[3] WANG Xin, ZHOU Ze-bao, YU Yun, CHEN Yu-xu, REN Hao-wen, JIANG Yi-bo, SUN Ling-yun. Reliable Incentive Mechanism for Federated Learning of Electric Metering Data [J]. Computer Science, 2022, 49(3): 31-38.
[4] FANG Ting, GONG Ao-yu, ZHANG Fan, LIN Yan, JIA Lin-qiong, ZHANG Yi-jin. Dynamic Broadcasting Strategy in Cognitive Radio Networks Under Delivery Deadline [J]. Computer Science, 2021, 48(7): 340-346.
[5] XU Jian-bo, SHU Hui, KANG Fei. Summary on Reverse Debugging Technology [J]. Computer Science, 2021, 48(5): 9-15.
[6] SU Qing, LI Zhi-zhou, LIU Tian-tian, WU Wei-min, HUANG Jian-feng, LI Xiao-mei. Tree Structure Evaluation Visualization Model for Program Debugging [J]. Computer Science, 2021, 48(5): 68-74.
[7] QI Hui, SHI Ying, LI Deng-ao, MU Xiao-fang, HOU Ming-xing. Software Reliability Prediction Based on Continuous Deep Confidence Neural Network [J]. Computer Science, 2021, 48(5): 86-90.
[8] YU Sheng, LI Bin, SUN Xiao-bing, BO Li-li, ZHOU Cheng. Approach for Knowledge-driven Similar Bug Report Recommendation [J]. Computer Science, 2021, 48(5): 91-98.
[9] FENG Kai, MA Xin-yu. Subnetwork Reliability of (n,k)-bubble-sort Networks [J]. Computer Science, 2021, 48(4): 43-48.
[10] CUI Jian-qun, HUANG Dong-sheng, CHANG Ya-nan, WU Shu-qing. Congestion Control Based on Message Quality and Node Reliability in DTN [J]. Computer Science, 2021, 48(4): 268-273.
[11] CHANG Jian-ming, BO Li-li, SUN Xiao-bing. Code Search Engine for Bug Localization [J]. Computer Science, 2021, 48(12): 140-148.
[12] SUN Chang-ai, ZHANG Shou-feng, ZHU Wei-zhong. Mutation Based Fault Localization Technique for BPEL Programs [J]. Computer Science, 2021, 48(1): 301-307.
[13] FENG Kai, LI Jing. Study on Subnetwork Reliability of k-ary n-cubes [J]. Computer Science, 2020, 47(7): 31-36.
[14] WANG Hui-yan, XU Jing-wei, XU Chang. Survey on Runtime Input Validation for Context-aware Adaptive Software [J]. Computer Science, 2020, 47(6): 1-7.
[15] CHENG Yu, LIU Wei, SUN Tong-xin, WEI Zhi-gang, DU Wei. Design of Fault-tolerant L1 Cache Architecture at Near-threshold Voltage [J]. Computer Science, 2020, 47(4): 42-49.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!