Computer Science ›› 2025, Vol. 52 ›› Issue (1): 120-130.doi: 10.11896/jsjkx.231200011

• Database & Big Data & Data Science • Previous Articles     Next Articles

Research Progress on Optimization Techniques of Tiered Storage Based on Deduplication

YAO Zilu1, FU Yinjin2, XIAO Nong1,2   

  1. 1 College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China
    2 National Supercomputer Center in Guangzhou,Sun Yat-Sen University,Guangzhou 510006,China
  • Received:2023-12-01 Revised:2024-04-27 Online:2025-01-15 Published:2025-01-09
  • About author:YAO Zilu,born in 2001,postgraduate,is a member of CCF(No.P8042G).His main research interests include tiered storage and deduplication and so on.
    XIAO Nong,born in 1969,Ph.D,professor,doctoral supervisor.His main research interests include large-scale storage system,cloud computing and network computing,and computer architecture.
  • Supported by:
    National Key Research and Development Program of China(2022YFB4500304) and National Natural Science Foundation of China(62332021,61832020).

Abstract: With the explosive growth of global data volume and the increasing diversity of data,storage systems with a single media layer are gradually unable to meet the diverse application demand of users.Tiered storage can classify and store data into storage layers with different access latency,storage capacity,and fault tolerance based on the importance,access frequency,security requirements,and other characteristics of the data.It has been widely applied in various fields.Deduplication is a big data reduction technique that can efficiently remove duplicate data from storage systems and maximize storage space utilization.Unlike single storage layer scenarios,applying deduplication to tiered storage can not only reduce cross-layer data redundancy,further save storage space and reduce storage costs,but also improve data I/O performance and storage device durability.After a brief analysis of the principle,process,and classification of deduplication based tiered storage,this paper starts with three key steps:storage location selection,duplicate content identification,and data migration operation.It summarizes the research progress of many optimization methods and explores the potential technical challenges of deduplication based tiered storage.Finally,the future development trends of deduplication based tiered storage is prospected.

Key words: Deduplication, Tiered storage, Storage location selection, Duplicate content identification, Data migration

CLC Number: 

  • TP311
[1]国际数据中心IDC[EB/OL].https://www.idc.com/.
[2]IBM数据生命周期管理[EB/OL].https://www.ibm.com/cn-zh/topics/data-lifecycle-management.
[3]XIE P.Survey on Data Deduplication Techniques for StorageSystem.[J].Computer Science,2014,41(1):22-30,42.
[4]FU Y,XIAO N,LIU F.Research and Development on KeyTechniques of Data Deduplication[J].Journal of Computer Research & Development,2012,49(1):12-20.
[5]LEE T,MONGA S K,MIN C W,et al.Memtis:Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination[C]//ACM SIGOPS 29th Symposium on Operating Systems Principles.ACM,New York,NY,USA,2023.
[6]HILDEBRAND M,KHAN J,TRIKA S,et al.AutoTM:Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming[C]//Architectural Support for Programming Languages and Operating Systems.ACM,2020.
[7]Amazon Web Services.Amazon s3 price[EB/OL].https://www.amazonaws.cn/s3/pricing/.
[8]Microsoft Azure.Storage Price [EB/OL].https://azure.mi-crosoft.com/zh-cn/pricing/details/storage/blobs/#pricing.
[9]百度云对象存储BOS[EB/OL].https://cloud.baidu.com/pro-duct/bos.html.
[10]阿里云对象存储OSS[EB/OL].https://www.aliyun.com/product/oss.
[11]腾讯云对象存储COS[EB/OL].https://cloud.tencent.com/product/cos.
[12]KOTLARSKA I,JACKOWSKI A,LICHOTA K,et al.Infty-Dedup:scalable and cost-effective cloud tiering with deduplication[C]//Proceedings of the 21st USENIX Conference on File and Storage Technologies.2023.
[13]YANG Z Y,WANG Y F,BHAMIN I,et al.EAD:elasticityaware deduplication manager for datacenters with multi-tier sto-rage systems[J].Cluster Computing,2018,21(3):1561-1579.
[14]WANG H,ZHANG J W,HUANG P,et al.Cache What YouNeed to Cache:Reducing Write Traffic in Cloud Cache via “One-Time-Access-Exclusion” Policy[J].ACM Transactions on Sto-rage,2020,16(3):1-24.
[15]ETHEM A.Introduction to Machine Learning[J].MIT Press,Cambridge,MA,2014.
[16]LEO B,FRIEDMAN J H,OLSHEN R A,et al.Classificationand Regression Trees[J].Biometrics,1984,40(3),358.
[17]XIA W,JIANG H,FENG D,et al.A comprehensive study of the past,present,and future of data deduplication[J].Proceedings of the IEEE,2016,104(9):1681-1710.
[18]WANG C D,WEI Q S,YANG J,et al.Nv-dedup:Highperformance inline deduplication for non-volatile memory[J].IEEE Transactions on Computers,2017,67(5):658-671.
[19]QIU J S,PAN Y Q,XIA W,et al.Light-Dedup:A Light-weight Inline Deduplication Framework for Non-Volatile Memory File Systems[C]//USENIX Annual Technical Conference.2023:101-116.
[20]BOTELHO F C,GARG N,SHILANE P N,et al.Memory efficient sanitization of a deduplicated storage system.中国专利:US9317218[P],2016.04.19.
[21]DUGGAL A,JENKINS F,SHILANE P,et al.Data domain cloud tier:backup here,backup there,deduplicated everywhere![C]//USENIX Annual Technical Conference.2019.
[22]ZHU B,LI K,PATTERSON H.Avoiding the disk bottleneck in the Data Domain deduplication file system[C]//6th USENIX Conference on File and Storage Technologies.2008.
[23]MERKLE RALPH C.Digital signature system and methodbased on a conventional encryption function,US7967587A[P].1987.07.30.
[24]ESHGHI K,LILLIBRIDGE M,WILCOCK L,et al.Jumbostore:Providing efficient incremental upload and versioning for a utility rendering service[C]//5th USENIX Conference on File and Storage Technologies.2007.
[25]SONG L S,DENG Y H,XIE J J.Exploiting Fingerprint Prefetching to Improve the Performance of Data Deduplication[C]//IEEE International Conference on High Performance Computing & Communications & IEEE International Confe-rence on Embedded & Ubiquitous Computing.2013.
[26]ZHOU Y T,DENG Y H,XIE J J.Leverage similarity and locality to enhance fingerprint prefetching of data deduplication[C]//IEEE International Conference on Parallel and Distributed Systems.2014.
[27]ZHOU Y T,DENG Y H,CHEN X G,et al.Identifying file similarity in large data sets by modulo file length[C]//Algorithms and Architectures for Parallel Processing.2014.
[28]MANKU G S,JAIN A,DAS S A.Detecting near-duplicates for web crawling[C]//International Conference on World Wide Web.ACM,2007.
[29]CHARIKAR M S.Similarity estimation techniques from rounding algorithms[C]//Thiry-fourth Acm Symposium on Theory of Computing.ACM,2002:380-388.
[30]INDYK P,MOTWANI R.Approximate nearest neighbors:towards removing the curse of dimensionality[C]//Proceedings of the 30th ACM Symposium on Theory of Computing(STOC’98)1998:604-613.
[31]QIN Y B,ZHANG X B,DAVID J.PBCCF:Accelerated Deduplication by Prefetching Backup Content Correlated Fingerprints[C]//2020 IEEE 38th International Conference on Computer Design.2020.
[32]GUO F,EFSTATHOPOULOS P.Building a high-performance deduplication system[C]//2011 USENIX Annual Technical Conference.USENIX Association,2011.
[33]ZHANG Y C,XIA W,FENG D,et al.Finesse:Fine-grained feature locality based fast resemblance detection for postdeduplication delta compression[C]//USENIX FAST.2019.
[34]PARK J,KIM J,KIM Y,et al.DeepSketch:A New MachineLearning-Based Reference Search Technique for Post-Deduplication Delta Compression[C]//20th USENIX Conference on File and Storage Technologies(FAST 22).2022:247-264.
[35]LLOYD S P.Least squares quantization in PCM[J].IEEETrans.,1982,28(2):129-137.
[36]SU S P,ZHANG C,HAN K,et al.Greedy hash:Towards fast optimization for accurate hash coding in cnn[C]//NIPS.2018:806-815.
[37]Yahoo! Japan Corp.Neighborhood graph and tree for indexing high-dimensional data[EB/OL].https://github.com/yahoojapan/NGT.
[38]ZUO P F,HUA Y,ZHAO M,et al.Improving the Performance and Endurance of Encrypted Non-Volatile Main Memory through Deduplicating Writes[C]//2018 51st Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).ACM,2018.
[39]JAULMES L,MORETO M,VALERO M,et al.A Vulnerability Factor for ECC-protected Memory[C]//2019 IEEE 25th International Symposium on On-Line Testing And Robust System Design(IOLTS).IEEE,2019.
[40]DU C F,WU S Z,WU J P,et al.ESD:An ECC-assisted and Selective Deduplication for Encrypted Non-Volatile Main Memory[C]//2023 IEEE International Symposium on High-Perfor-mance Computer Architecture.2023.
[41]YIN J W,TANG Y,DENG S G,et al.MUSE:A Multi-Tierdand SLA-Driven Deduplication Framework for Cloud Storage Systems[J].IEEE Transactions on Computers,2021,70(5):759-774.
[42]SLEATOR D D,Tarjan R E.Amortized efficiency of list update paging rules[J].Communications of the ACM,1985,28(2):202-208.
[43]MEGIDDO N.ARC:A self-tuning,low overhead Replacement cache[C]//USENIX File and Storaqe Technologies Conference(FAST’03).2003.
[44]JIANG S.LIRS:An Efficient Low Inter-reference Recency Set Replacement Policy to Improve Buffer Cache Performance[C]//Proceedings of the International Conference on Measurements and Modeling of Computer Systems,2002.
[45]ZHOU Y Y,PHILBIN J,LI K.The multi-queue replacement algorithm for second level buffer caches[C]//Proceedings of the USENIX Annual Technical Conference.CA,USA,2002:91-104.
[46]WILKES T M W J.My cache or yours? Making storage more exclusive[C]//Proceedings of the General Track:2002 USENIX Annual Technical Conference.2002.
[47]XIAO N,ZHAO Y J,LIU F,et al.Dual queues cache replacement algorithm based on sequentiality detection[J].Science China(Information Sciences),2012,55(1):191-199.
[48]LI W J,GREGORY J B,JUAN R,et al.CacheDedup:in-line deduplication for flash caching[C]//Proceedings of the 14th Usenix Conference on File and Storage Technologies.2016.
[49]CAO Z C,WEN H,GE X Z,et al.TDDFS A Tier-Aware Data Deduplication-Based File System[J].ACM Transactions on Storage,2019,15(1):4.
[50]KISOUS R,KOLIKANT A,DUGGAL A,et al.The what,The from,and The to:The Migration Games in Deduplicated Systems[J].ACM Transactions on Storage,2022,18(4):1-29.
[51]HARNIK D,HERSHCOVITCH M,SHATSKY Y,et al.Ske-tching volume capacities in deduplicated storage[C]//17th USENIX Conference on File and Storage Technologies.2019.
[52]NACHMAN A,SHEINVALD S,KOLIKANT A,et al.Go-Seed:Optimal seeding plan for deduplicated storage[J].ACM Transactions on Storage,2021,17(3):1-28.
[53]Gurobi[EB/OL].https://www.gurobi.com/.
[54]LIU Y,WANG H,ZHOU K,et al.A survey on AI for storage[J].CCF Transactions on High Performance Computing,2022,4(3):233-264.
[1] MAO Zhixiong, LIU Zhinan, GAO Xuning, WANG Mengxiang, GONG Shufeng, ZHANG Yanfeng. Power-PCSR:An Efficient Dynamic Graph Storage Structure for Power-law Graphs [J]. Computer Science, 2024, 51(8): 56-62.
[2] ZHOU Yiteng, TANG Xin, JIN Luchao. Adaptive MSB Reversible Data Hiding Based Security Deduplication for Encrypted Images in Cloud Storage [J]. Computer Science, 2024, 51(12): 352-360.
[3] XU Kun, FU Yin-jin, CHEN Wei-wei, ZHANG Ya-nan. Research Progress on Blockchain-based Cloud Storage Security Mechanism [J]. Computer Science, 2021, 48(11): 102-115.
[4] LU Ye-shan. Common Issues and Case Analysis of System Data Migration [J]. Computer Science, 2019, 46(6A): 412-416.
[5] ZHANG Gui-peng, CHEN Ping-hua. Secure Data Deduplication Scheme Based on Merkle Hash Tree in HybridCloud Storage Environments [J]. Computer Science, 2018, 45(11): 187-192.
[6] ZHANG Yong, ZHANG Jie-hui and LIU Bin. Big Data Dynamic Migration Method Based on Global Load Balancing in Cloud Environment [J]. Computer Science, 2018, 45(1): 196-199.
[7] LI Feng, LU Ting-ting and GUO Jian-hua. Effective Image File Storage Technique Using Improved Data Deduplication [J]. Computer Science, 2016, 43(Z11): 495-498.
[8] XIE Ping. Survey on Data Deduplication Techniques for Storage Systems [J]. Computer Science, 2014, 41(1): 22-30.
[9] SHI Guang-yuan and ZHANG Yu. Hierarchical Storage Access Model Based on Multi-Attributes Measurement [J]. Computer Science, 2013, 40(Z11): 165-169.
[10] SHI Guang-yuan and ZHANG Yu. Research on Fuzzy Logic-based Model of Tiered Storage [J]. Computer Science, 2013, 40(Z11): 284-287.
[11] ZHENG Sheng and LI Tong. Data Placement Algorithm for Large-scale Storage System [J]. Computer Science, 2013, 40(Z11): 270-273.
[12] LUO Xiang-yu,WANG Yun and CHEN Xiao-mei. Evaluation and Analysis of Load Balancing Mechanisms in Storage Systems [J]. Computer Science, 2013, 40(9): 55-60.
[13] . Research on Evidence Collection under Cloud Computing Environment [J]. Computer Science, 2012, 39(9): 105-108.
[14] GE Xiong-zi,FENG Dan,LU Cheng-tao,JIN Chao. Dynamic Analysis Model of Green Network Storage Systems [J]. Computer Science, 2011, 38(8): 291-296.
[15] LIU Ke,QIN Lei-hua,ZHOU Jing-li,NIE Xue-jun,ZENG Dong. Two-phrase Retrieval Strategy in Content Aware Network Storage System [J]. Computer Science, 2011, 38(5): 20-23.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!