计算机科学 ›› 2014, Vol. 41 ›› Issue (1): 22-30.

• 综述 • 上一篇    下一篇

存储系统重复数据删除技术研究综述

谢平   

  1. 青海师范大学计算机学院 西宁810008; 华中科技大学计算机科学与技术学院 武汉430074
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家973重点基础研究发展计划(2011CB302303)资助

Survey on Data Deduplication Techniques for Storage Systems

XIE Ping   

  • Online:2018-11-14 Published:2018-11-14

摘要: 目前企业对数据量不断增长的需求使得数据中心面临严峻的挑战。研究发现,存储系统中高达60%的数据是冗余的,如何缩减存储系统中的冗余数据受到越来越多科研人员的关注。重复数据删除技术利用CPU计算资源,通过数据块指纹对比能够有效地减少数据存储空间,已成为工业界和学术界研究的热点。在分析和总结近10年重复数据删除技术文献后,首先通过分析卷级重删系统体系结构,阐述了重删系统的原理、实现机制和评价标准。然后结合数据规模行为对重删系统性能的影响,重点分析和总结了重删系统的各种性能改进技术。最后对各种应用场景的重删系统进行对比分析,给出了4个需要重点研究的方向,包括基于主存储环境的重删方案、基于分布式集群环境的重删方案、快速指纹查询优化技术以及智能数据检测技术。

关键词: 重复数据删除,重删率,体系结构,元数据结构,I/O优化

Abstract: With the ever-increasing data volume in enterprises,the needs of massive data storage capacity currently become a grand challenge in data centers,and researching shows that there are about 60% redundant data in storage systems.Therefore,the problems of high redundancy in data storage systems are paid much more attentions by resear-chers.Exploiting CPU resource to compare the data block’s fingerprint which is unique,data deduplication techniques can efficiently accomplish data reduction in storage systems,thus data deduplication techniques have become a hot topic in both industry and academia fields.Based on adequately analyzing and summarizing literatures on data deduplication techniques appeared in recent ten years,this paper first presented the principle of representative data deduplication systems,implementation mechanisms as well as evaluation methodologies after analyzing volume-level data deduplication system architecture.Second,we also focused on existing deduplication optimizing techniques with consideration of both the characteristics of data and scale of data deduplication systems.Finally four new research directions were given as follows by comparatively analyzing various application scenarios of data deduplication systems,including research of primary-Storage-Level data deduplication approaches,research of distributed data deduplication scheme for clustered storage systems,research of highly-efficient fingerprint searching techniques and research of intelligent data detection techniques.

Key words: Data deduplication,Deduplication ratio,System architecture,Metadata structure,I/O optimization

[1] Gartner:IT数据量平均增长40%至60% [EB/ OL].http://www.199it.com/archives/16863.html,2011-10-13/2012-06-05
[2] Greenan K M,Long D D E,et al.A spin-up save- d is energy earned:achieving power-efficient,erasurecoded storage[A]∥Proceedings of the 4th Conference on Hot Topics in System Dependability[C].Berkeley:USENIX,2008:4-4
[3] 郭平.消除冗余解放容量[EB/OL].http://www2.ccw.com.cn/07/0710/c/0710c24_4.html,2007-03-19/2012-06-07
[4] McKnight J,Asaro T,et al.Digital archiving:end-user surveyand market forecast 2006-2010[EB/OL].http://www.esg-global.com/research-reports/digital-archiving-end-user-survey-market-forecast-2006-2010/,2006-03-15/2012-06-07
[5] 敖莉,舒继武,李明强.重复数据删除技术[J].软件学报,2010,21(5):916-929
[6] 付印金,肖侬,刘芳.重复数据删除关键技术研究进展[J].计算机研究与发展,2012,49(1):12-20
[7] Lessfs:Open source data deduplication[EB/OL].http://www.lessfs.com/wordpress/,2009-03-25/2012-07-05
[8] OpenDedup:Deduplication with OpenDedup [EB/ OL].http://www.tuxlanding.net/deduplication- with-opendedup/,2011-07-13/2012-05-05
[9] FUSE:File systems using FUSE[EB/OL].http:// fuse.sourceforge.net/,2012-08-23/2012-08-25
[10] SCST:GENERIC SCSI TARGET SUBSYSTEM FOR LINUX[EB/OL].http://scst.sourceforge.net /index.html,2012-03-20/2012-06-25
[11] Ng C-H,Ma Ming-cao,et al.Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud[A]∥Proceedings of the 12th ACM/IFIP/USENIX International Conference on Middleware[C].Berlin:Spinger-Verlag,2011:81-100
[12] Koller R,Rangaswami R.I/O Deduplication:Utilizing Content Similarity to Improve I/O Performance[J].ACM Transactions on Storage,2010,6(3):13
[13] Srinivasan K,Bisson T,et al.iDedup:Latency-aware,inline data deduplication for primary storage[A]∥Proceedings of 10th USENIX Conference on File and Storage Technologies [C].CA,USA:USENIX,2012:299-312
[14] Hong Bo,Plantenberg D,et al.Duplicate data elimination in aSAN file system[A]∥Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies[C].College Park,MD:IEEE,2004:301-314
[15] 去重和压缩[EB/OL].http://articles.e-works.net.cn/storage/article79873.htm,2010-08-24/2012-07-03
[16] Bolosky W J,Corbin S,et al.Single instance storage in windows 2000[A]∥Proceedings of the 4th USENIX Windows System Symposium[C].Washington:USENIX,2000:13-24
[17] Tsuchiya Y,Watanabe T,et al.DBLK:Deduplication for Primary Block Storage[A]∥Proceedings of the 27th IEEE Symposium on Mass Storage Systems and Technologies[C].Piscataway:IEEE,2011:1-5
[18] Denehy T E,Hsu W W.Duplicate management for reference data[R].IBM Research Report,RJ 10305(A0310-017).IBM Research Division,2003
[19] Bobbarjung D R,Jagannathan S,et al.Improving DuplicateElimination in Storage Systems[J].ACM Transaction on Storage,2006,2(4):424-448
[20] Understanding data deduplication ratios [EB/OL].http://www.snia.org/sites/default/files/Understanding_Data_Deduplication_Ratios-20080718.pdf,2008-07-18/2012-03-15
[21] Tan Yu-juan,Jiang Hong,et al.SAM:A Semantic-AwareMulti-Tiered Source De-duplication Framework for Cloud backup [A]∥Proceedings of the 39th International Conference on Parallel Processing[C].Los Alamitos,CA,USA:IEEE,2010:614-623
[22] Hash Collisions:The Real Odds[EB/OL].http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html,2007-10-14/2011-12-05
[23] Guo F,Efstathopoulos P.Building a high performance deduplication system[A]∥Proceedings of the 2011USENIX Annual Technical Conference [C].Berkeley:USENIX,2011:25-25
[24] Zhu Benjamin,Li Kai,et al.Avoiding the disk bottleneck in the Data Domain deduplication file system[A]∥Proceedings of the 6th USENIX Conference on File and Storage Technologies [C].Berkeley:USENIX,2008:269-282
[25] Quinlan S,Dorward S.Venti:A new approach to archival stora-ge[A]∥Proceedings of the FAST’02Conference on File and Storage Technologies[C].Berkeley:USENIX,2002:89-101
[26] Lillibridge M,Eshghi K,et al.Sparse indexing:Large scale,inline deduplication using sampling and locality[A]∥Proceedings of the 7th USENIX Conference on File and Storage Technologies[C].Berkeley:USENIX,2009:111-123
[27] Bhagwat D,Eshghi K,et al.Extreme Binning:Scalable,parallel deduplication for chunk-based file backup[A]∥Proceedings of the 17th IEEE International Symposium on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems[C].London:IEEE,2009:1-9
[28] Xia Wen,Jiang Hong,et al.Accelerating Data De- duplication by Exploiting Pipelining and Parallelism with Multicore or Manycore Processors [EB/OL].http://static.usenix.org/events/fast12/ poster_descriptions/Xiadescription.pdf,2012-3-2/2012-7-6
[29] Ousterhout J K,Agrawal P,et al.The case for RAMClouds:scalable high-performance storage entirely in DRAM[J].Opera-ting Systems Review,2009,43(4):92-105
[30] Bartizal D.Thomas Northfield.Solid State Drive PerformanceWhite Paper[EB/OL].http://www.csee.umbc.edu/~squire/images/ssd2.pdf,2008-3-24/2012-6-7
[31] Benefits of SSD vs.HDD[EB/OL].http://www.amplicon.com/docs/white-papers/SSD-vs-HDD-white-paper.pdf,2012-3-21/2012-7-8
[32] Solid State Drive vs.Hard Disk Drive Price and PerformanceStudy[EB/OL].http://www.dell.com/downloads/global/products/pvaul/en/ssd_vs_hdd_price_and_performance_study.pdf,2011-5-1/2012-8-19
[33] Flash Memory Technology in Enterprise Storage Flexible Cho-ices to Optimize Performance [EB/ OL].http://www.itdialogue.com/wp-content/ uploads/2010/04/Flash-in-Enterprise-Storage.pdf,2008-11-1/2012-3-4
[34] Debnath B,Sengupta S,et al.Chunkstash:speeding up in-linestorage deduplication using flash memory[A]∥Proceedings of the 2010USENIX Annual Technical Conference[C].Boston:USENIX,2010:16-16
[35] The Art of Data Deduplication.http:// www.ecsl.cs.sunysb.edu/tr/rpe21.pdf
[36] Dubnicki C,Gryz L,et al.HYDRAstor:a scalable secondarystorage[A]∥Proceedings of the 7th USENIX Conference on File and Storage Echnologies [C].Berkeley:USENIX,2009:197-210
[37] IBM System Storage N series Software Guide.http://www.redbooks.ibm.com/abstracts/sg247129.html,December 2010
[38] Alvarez C.NetApp deduplication for FAS and V-Series deployment and implementation guide[R].Technical Report TR-3505.NetApp,2011
[39] EMC.Achieving storage efficiency through EMC Celerra data deduplication[M].White paper,Mar.2010
[40] IBM Corporation.IBM white paper:IBM Storage Tank-A distributed storage system[M].Jan.2002
[41] Kulkarni P,Douglis F,et al.Redundancy elimination withinlarge collections of files[A]∥Proceedings of the 2004USENIX Annual Technical Conference[C].Boston:USENIX,2004:59-72
[42] You L L,Pollack K T,et al.Deep Store:An archival storagesystem architecture[A]∥Proceedings of the 21st International Conference on Data Engineering[C].Los Alamitos:IEEE,2005:804-815
[43] Jain N,Dahlin M,et al.TAPER:Tiered approach for eliminating redundancy in replica synchronization[A]∥Proceedings of the 5th USENIX Conference on File and Storage Technologies [C].Berkeley:USENIX,2005:281-294
[44] Rhea S,Cox R,et al.Fast,inexpensive content-addressed storage in foundation[A]∥Proceedings of the 2008USENIX AnnualTechnical Conference[C].Berkeley:USENIX,2008:143-156
[45] Meister D,Brinkmann A.dedupv1:Improving deduplication th-roughput using solid state drives (SSD)[A]∥Proceedings of the 26th IEEE Conference on Mass Storage Systems and Technologies[C].Piscataway:IEEE,2010:1-6
[46] Dong W,Douglis F,et al.Tradeoffs in scalable data routing for deduplication clusters[A]∥Proceedings of the Ninth USENIX Conference on File and Storage Technologies [C].Berkeley:USENIX,2011:15-29
[47] Xia W,Jiang H,et al.Silo:a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput[A]∥Proceedings of the 2011USENIX Annual Technical Con- ference[C].Berkeley:USENIX,2011:26-28
[48] Zfs deduplication[EB/OL].https://blogs.oracle.com/bonwick/entry/zfs_dedup,2009-11-01/2011.11.05
[49] Data striping[EB/OL].http://en.wikipedia.org /wiki/Data_striping,2012-08-15/2012-08-23
[50] Reed-Solomon Codes [EB/OL].http://hscc.cs.nthu.edu.tw/~sheujp/lecture_note/rs.pdf

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!