计算机科学 ›› 2025, Vol. 52 ›› Issue (2): 48-57.doi: 10.11896/jsjkx.241000022
孙婧1, 牛虹婷2, 梁松涛3
SUN Jing1, NIU Hongting2, LIANG Songtao3
摘要: 纠删码算法在单数据中心和多数据中心得到了广泛的应用。目前对纠删码算法的研究更多地关注存储成本和修复带宽,对于如何在专线带宽、交换机受限的情况下完成多数据中心之间的修复,如何在可靠性、容错能力等核心因素之间实现最佳权衡等问题,没有进行充分的分析和解决。针对三数据中心这种最常用的多数据中心场景,首先,提出了纠删码在系统设计中重要的4个因素:冗余度、可靠性、容错能力及解码带宽。其次,根据提出的4个因素,设计了一种单数据中心下满足最优带宽修复的S-LRC算法。再根据提出的S-LRC算法,设计了满足三中心架构体系下的G-LRC算法。相比传统的编码方案,提出的G-LRC算法具有更高的可靠性、更大的容错性及解码带宽惩罚比。其两节点故障时解码带宽惩罚比仅为传统方案的1/7~2/7。最后,将G-LRC算法在大文件存储系统中进行了实现和验证,并且设计了解码最优决策算法来减少修复的带宽,解决了非最大距离可分割码算法在系统中落地难的问题。
中图分类号:
[1]FORD D,LABELLE F,POPOVICI F I,et al.Availability inGlobally Distributed Storage Systems[C]//The 9th USENIX Symposium on Operating Systems Design and Implementation.2010:61-74. [2]SVEND F,ARIF M,YASUSHI S,et al.A decentralized algo-rithm for erasure-coded virtual disks[C]//In International Conference on Dependable Systems and Networks.2004:125-134. [3]SCHROEDER B,GIBSON G A.Disk failures in the real world:What does an MTTF of 1 000 000 hours mean to you? [C]//Proceedings of the 5th USENIX Conference on File and Storage Technologies.2007:1-16. [4]SUN J,LIANG S T,LU X J.An approximately optimal disk repair algorithm for distributed storage systems[J].Science China Information Sciences,2020,50:1834-1849. [5]HU Y,LIU Y,LI W,et al.Unequal failure protection coding technique for distributed cloud storage Systems[J].IEEE Transactions on Cloud Computing,2021,9(1):386-400. [6]ZHANG C,XU Y, HU Y P,et al.A Blockchain-based multi-cloud storage data auditing scheme to locate faults[J].IEEE Transactions on Cloud Computing,2022,10(4):2252-2263. [7]HUANG C,XU L.STAR:An efficient coding scheme for correcting triple storage node failures[J].IEEE Transactions on Computers,2008,57(7):889-901. [8]KV RASHMI,NIHAR B S,GU D K,et al.A hitchhiker's guide to fast and efcient data reconstruction in erasure-coded data centers[J].ACM SIGCOMM Computer Communication Review,2014,44:331-342. [9]IRVING S R,GUSTAVE S.Polynomial codes over certain finite fields[J].Journal of the Society for Industrial and Applied Mathematics,1960,8(2):300-304. [10]HUANG C,SIMITCI H,XU Y,et al.Erasure coding in windows azure storage[C]//Proceedings of the 2012 USENIX Conference on Annual Technical Conference.Berkeley:USENIX Association,2012:15-26. [11]GOPALAN P,HUANG C,JENKINS B,et al.Explicit maximally recoverable codes with locality[J].IEEE Transactions on Information Theory,2014,60(9),5245-5256. [12]LIU K,PENG J,HUANG Z,et al.Adaptive and scalable caching with erasure codes in distributed cloud-edge storage systems[J].IEEE Transactions on Cloud Computing,2023,11(2):1840-1853. [13]HOU H,LEE P P C.Rack-aware regenerating codes for data centers[J].IEEE Transactions on Information Theory,2019,65(8):4730-4745. [14]SUN M,TANG D,LI Y,et al.An Erasure Code with Low Repair-Cost Based on A Combined-stripe Encoding Structure[J].International Journal of Network Security,2024,26(2):206-216. [15]ZHOU H,FENG D.Boosting erasure-coded multi-stripe repairin rack architecture and heterogeneous clusters:design and ana-lysis[J].IEEE Transactions on Parallel and Distributed Systems,2023,34(8):2251-2264. [16]PATIL A,RANGARAO D,SEIPP H,et al.Cloud object storage as a service:IBM cloud object storage from theory to practice-for developers,IT architects and IT specialists[EB/OL].https://www.redbooks.ibm.com. [17]DELL TECHNICAL WHITE PAPER.Predicts 2021:Dell EMC ECS:High Availability Design[EB/OL].https://www.delltechnologies.com/asset/en-ie/products/storage/industry-market/h16344-ecs-high-availability-design.pdf. [18]年终盘点:2023年最重大的15次云故障[EB/OL].https://weibo.com/ttarticle/p/show?id=2309404982883017425028. [19]GREENAN K M,PLANK J S,W J J.Mean time to Meaningless:MTTDL,Markov models,and storage system reliability[C]//HotStorage.2010:1-5. [20]YANG C C.Large scale distributed storage system:principleanalysis and architecture practice[M].Mechanical Industry Press.2013:22-23. |
|