计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 120-130.doi: 10.11896/jsjkx.231200011
姚子路1, 付印金2, 肖侬1,2
YAO Zilu1, FU Yinjin2, XIAO Nong1,2
摘要: 随着全球数据量的爆炸式增长以及数据多样性的日益丰富,单一介质层的存储系统逐渐不能满足用户多样化的应用需求。分层存储技术可依据数据的重要性、访问频率、安全性需求等特征将数据分类存放到具有不同访问延迟、存储容量、容错能力的存储层中,已经在各个领域得到广泛应用。重复数据删除是一种面向大数据的缩减技术,可高效去除存储系统中的重复数据,最大化存储空间利用率。不同于单存储层场景,将重复数据删除技术运用于分层存储中,不仅能减少跨层数据冗余,进一步节省存储空间、降低存储成本,还能更好地提升数据I/O性能和存储设备的耐久性。在简要分析基于重复数据删除的分层存储技术的原理、流程和分类之后,从存储位置选择、重复内容识别和数据迁移操作3个关键步骤入手,深入总结了诸多优化方法的研究进展,并针对基于重复数据删除的分层存储技术潜在的技术挑战进行了深入探讨。最后展望了基于重复数据删除的分层存储技术的未来发展趋势。
中图分类号:
[1]国际数据中心IDC[EB/OL].https://www.idc.com/. [2]IBM数据生命周期管理[EB/OL].https://www.ibm.com/cn-zh/topics/data-lifecycle-management. [3]XIE P.Survey on Data Deduplication Techniques for StorageSystem.[J].Computer Science,2014,41(1):22-30,42. [4]FU Y,XIAO N,LIU F.Research and Development on KeyTechniques of Data Deduplication[J].Journal of Computer Research & Development,2012,49(1):12-20. [5]LEE T,MONGA S K,MIN C W,et al.Memtis:Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination[C]//ACM SIGOPS 29th Symposium on Operating Systems Principles.ACM,New York,NY,USA,2023. [6]HILDEBRAND M,KHAN J,TRIKA S,et al.AutoTM:Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming[C]//Architectural Support for Programming Languages and Operating Systems.ACM,2020. [7]Amazon Web Services.Amazon s3 price[EB/OL].https://www.amazonaws.cn/s3/pricing/. [8]Microsoft Azure.Storage Price [EB/OL].https://azure.mi-crosoft.com/zh-cn/pricing/details/storage/blobs/#pricing. [9]百度云对象存储BOS[EB/OL].https://cloud.baidu.com/pro-duct/bos.html. [10]阿里云对象存储OSS[EB/OL].https://www.aliyun.com/product/oss. [11]腾讯云对象存储COS[EB/OL].https://cloud.tencent.com/product/cos. [12]KOTLARSKA I,JACKOWSKI A,LICHOTA K,et al.Infty-Dedup:scalable and cost-effective cloud tiering with deduplication[C]//Proceedings of the 21st USENIX Conference on File and Storage Technologies.2023. [13]YANG Z Y,WANG Y F,BHAMIN I,et al.EAD:elasticityaware deduplication manager for datacenters with multi-tier sto-rage systems[J].Cluster Computing,2018,21(3):1561-1579. [14]WANG H,ZHANG J W,HUANG P,et al.Cache What YouNeed to Cache:Reducing Write Traffic in Cloud Cache via “One-Time-Access-Exclusion” Policy[J].ACM Transactions on Sto-rage,2020,16(3):1-24. [15]ETHEM A.Introduction to Machine Learning[J].MIT Press,Cambridge,MA,2014. [16]LEO B,FRIEDMAN J H,OLSHEN R A,et al.Classificationand Regression Trees[J].Biometrics,1984,40(3),358. [17]XIA W,JIANG H,FENG D,et al.A comprehensive study of the past,present,and future of data deduplication[J].Proceedings of the IEEE,2016,104(9):1681-1710. [18]WANG C D,WEI Q S,YANG J,et al.Nv-dedup:Highperformance inline deduplication for non-volatile memory[J].IEEE Transactions on Computers,2017,67(5):658-671. [19]QIU J S,PAN Y Q,XIA W,et al.Light-Dedup:A Light-weight Inline Deduplication Framework for Non-Volatile Memory File Systems[C]//USENIX Annual Technical Conference.2023:101-116. [20]BOTELHO F C,GARG N,SHILANE P N,et al.Memory efficient sanitization of a deduplicated storage system.中国专利:US9317218[P],2016.04.19. [21]DUGGAL A,JENKINS F,SHILANE P,et al.Data domain cloud tier:backup here,backup there,deduplicated everywhere![C]//USENIX Annual Technical Conference.2019. [22]ZHU B,LI K,PATTERSON H.Avoiding the disk bottleneck in the Data Domain deduplication file system[C]//6th USENIX Conference on File and Storage Technologies.2008. [23]MERKLE RALPH C.Digital signature system and methodbased on a conventional encryption function,US7967587A[P].1987.07.30. [24]ESHGHI K,LILLIBRIDGE M,WILCOCK L,et al.Jumbostore:Providing efficient incremental upload and versioning for a utility rendering service[C]//5th USENIX Conference on File and Storage Technologies.2007. [25]SONG L S,DENG Y H,XIE J J.Exploiting Fingerprint Prefetching to Improve the Performance of Data Deduplication[C]//IEEE International Conference on High Performance Computing & Communications & IEEE International Confe-rence on Embedded & Ubiquitous Computing.2013. [26]ZHOU Y T,DENG Y H,XIE J J.Leverage similarity and locality to enhance fingerprint prefetching of data deduplication[C]//IEEE International Conference on Parallel and Distributed Systems.2014. [27]ZHOU Y T,DENG Y H,CHEN X G,et al.Identifying file similarity in large data sets by modulo file length[C]//Algorithms and Architectures for Parallel Processing.2014. [28]MANKU G S,JAIN A,DAS S A.Detecting near-duplicates for web crawling[C]//International Conference on World Wide Web.ACM,2007. [29]CHARIKAR M S.Similarity estimation techniques from rounding algorithms[C]//Thiry-fourth Acm Symposium on Theory of Computing.ACM,2002:380-388. [30]INDYK P,MOTWANI R.Approximate nearest neighbors:towards removing the curse of dimensionality[C]//Proceedings of the 30th ACM Symposium on Theory of Computing(STOC’98)1998:604-613. [31]QIN Y B,ZHANG X B,DAVID J.PBCCF:Accelerated Deduplication by Prefetching Backup Content Correlated Fingerprints[C]//2020 IEEE 38th International Conference on Computer Design.2020. [32]GUO F,EFSTATHOPOULOS P.Building a high-performance deduplication system[C]//2011 USENIX Annual Technical Conference.USENIX Association,2011. [33]ZHANG Y C,XIA W,FENG D,et al.Finesse:Fine-grained feature locality based fast resemblance detection for postdeduplication delta compression[C]//USENIX FAST.2019. [34]PARK J,KIM J,KIM Y,et al.DeepSketch:A New MachineLearning-Based Reference Search Technique for Post-Deduplication Delta Compression[C]//20th USENIX Conference on File and Storage Technologies(FAST 22).2022:247-264. [35]LLOYD S P.Least squares quantization in PCM[J].IEEETrans.,1982,28(2):129-137. [36]SU S P,ZHANG C,HAN K,et al.Greedy hash:Towards fast optimization for accurate hash coding in cnn[C]//NIPS.2018:806-815. [37]Yahoo! Japan Corp.Neighborhood graph and tree for indexing high-dimensional data[EB/OL].https://github.com/yahoojapan/NGT. [38]ZUO P F,HUA Y,ZHAO M,et al.Improving the Performance and Endurance of Encrypted Non-Volatile Main Memory through Deduplicating Writes[C]//2018 51st Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).ACM,2018. [39]JAULMES L,MORETO M,VALERO M,et al.A Vulnerability Factor for ECC-protected Memory[C]//2019 IEEE 25th International Symposium on On-Line Testing And Robust System Design(IOLTS).IEEE,2019. [40]DU C F,WU S Z,WU J P,et al.ESD:An ECC-assisted and Selective Deduplication for Encrypted Non-Volatile Main Memory[C]//2023 IEEE International Symposium on High-Perfor-mance Computer Architecture.2023. [41]YIN J W,TANG Y,DENG S G,et al.MUSE:A Multi-Tierdand SLA-Driven Deduplication Framework for Cloud Storage Systems[J].IEEE Transactions on Computers,2021,70(5):759-774. [42]SLEATOR D D,Tarjan R E.Amortized efficiency of list update paging rules[J].Communications of the ACM,1985,28(2):202-208. [43]MEGIDDO N.ARC:A self-tuning,low overhead Replacement cache[C]//USENIX File and Storaqe Technologies Conference(FAST’03).2003. [44]JIANG S.LIRS:An Efficient Low Inter-reference Recency Set Replacement Policy to Improve Buffer Cache Performance[C]//Proceedings of the International Conference on Measurements and Modeling of Computer Systems,2002. [45]ZHOU Y Y,PHILBIN J,LI K.The multi-queue replacement algorithm for second level buffer caches[C]//Proceedings of the USENIX Annual Technical Conference.CA,USA,2002:91-104. [46]WILKES T M W J.My cache or yours? Making storage more exclusive[C]//Proceedings of the General Track:2002 USENIX Annual Technical Conference.2002. [47]XIAO N,ZHAO Y J,LIU F,et al.Dual queues cache replacement algorithm based on sequentiality detection[J].Science China(Information Sciences),2012,55(1):191-199. [48]LI W J,GREGORY J B,JUAN R,et al.CacheDedup:in-line deduplication for flash caching[C]//Proceedings of the 14th Usenix Conference on File and Storage Technologies.2016. [49]CAO Z C,WEN H,GE X Z,et al.TDDFS A Tier-Aware Data Deduplication-Based File System[J].ACM Transactions on Storage,2019,15(1):4. [50]KISOUS R,KOLIKANT A,DUGGAL A,et al.The what,The from,and The to:The Migration Games in Deduplicated Systems[J].ACM Transactions on Storage,2022,18(4):1-29. [51]HARNIK D,HERSHCOVITCH M,SHATSKY Y,et al.Ske-tching volume capacities in deduplicated storage[C]//17th USENIX Conference on File and Storage Technologies.2019. [52]NACHMAN A,SHEINVALD S,KOLIKANT A,et al.Go-Seed:Optimal seeding plan for deduplicated storage[J].ACM Transactions on Storage,2021,17(3):1-28. [53]Gurobi[EB/OL].https://www.gurobi.com/. [54]LIU Y,WANG H,ZHOU K,et al.A survey on AI for storage[J].CCF Transactions on High Performance Computing,2022,4(3):233-264. |
|