存储系统重复数据删除技术研究综述

Abstract

Abstract: With the ever-increasing data volume in enterprises,the needs of massive data storage capacity currently become a grand challenge in data centers,and researching shows that there are about 60% redundant data in storage systems．Therefore,the problems of high redundancy in data storage systems are paid much more attentions by resear-chers．Exploiting CPU resource to compare the data block’s fingerprint which is unique,data deduplication techniques can efficiently accomplish data reduction in storage systems,thus data deduplication techniques have become a hot topic in both industry and academia fields．Based on adequately analyzing and summarizing literatures on data deduplication techniques appeared in recent ten years,this paper first presented the principle of representative data deduplication systems,implementation mechanisms as well as evaluation methodologies after analyzing volume-level data deduplication system architecture．Second,we also focused on existing deduplication optimizing techniques with consideration of both the characteristics of data and scale of data deduplication systems．Finally four new research directions were given as follows by comparatively analyzing various application scenarios of data deduplication systems,including research of primary-Storage-Level data deduplication approaches,research of distributed data deduplication scheme for clustered storage systems,research of highly-efficient fingerprint searching techniques and research of intelligent data detection techniques.

Key words: Data deduplication,Deduplication ratio,System architecture,Metadata structure,I/O optimization

XIE Ping. Survey on Data Deduplication Techniques for Storage Systems[J].Computer Science, 2014, 41(1): 22-30.

References

[1] Gartner:IT数据量平均增长40%至60% [EB/ OL]．http://www.199it.com/archives/16863.html,2011-10-13/2012-06-05
[2] Greenan K M,Long D D E,et al．A spin-up save- d is energy earned:achieving power-efficient,erasurecoded storage[A]∥Proceedings of the 4th Conference on Hot Topics in System Dependability[C].Berkeley:USENIX,2008:4-4
[3] 郭平.消除冗余解放容量[EB/OL]．http://www2．ccw．com.cn/07/0710/c/0710c24_4.html,2007-03-19/2012-06-07
[4] McKnight J,Asaro T,et al．Digital archiving:end-user surveyand market forecast 2006-2010[EB/OL].http://www.esg-global.com/research-reports/digital-archiving-end-user-survey-market-forecast-2006-2010/,2006-03-15/2012-06-07
[5] 敖莉,舒继武,李明强.重复数据删除技术[J].软件学报,2010,21(5):916-929
[6] 付印金,肖侬,刘芳.重复数据删除关键技术研究进展[J].计算机研究与发展,2012,49(1):12-20
[7] Lessfs:Open source data deduplication[EB/OL]．http://www.lessfs.com/wordpress/,2009-03-25/2012-07-05
[8] OpenDedup:Deduplication with OpenDedup [EB/ OL]．http://www.tuxlanding.net/deduplication- with-opendedup/,2011-07-13/2012-05-05
[9] FUSE:File systems using FUSE[EB/OL]．http:// fuse.sourceforge.net/,2012-08-23/2012-08-25
[10] SCST:GENERIC SCSI TARGET SUBSYSTEM FOR LINUX[EB/OL]．http://scst.sourceforge.net /index.html,2012-03-20/2012-06-25
[11] Ng C-H,Ma Ming-cao,et al．Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud[A]∥Proceedings of the 12^th ACM/IFIP/USENIX International Conference on Middleware[C].Berlin:Spinger-Verlag,2011:81-100
[12] Koller R,Rangaswami R．I/O Deduplication:Utilizing Content Similarity to Improve I/O Performance[J]．ACM Transactions on Storage,2010,6(3):13
[13] Srinivasan K,Bisson T,et al．iDedup:Latency-aware,inline data deduplication for primary storage[A]∥Proceedings of 10th USENIX Conference on File and Storage Technologies [C].CA,USA:USENIX,2012:299-312
[14] Hong Bo,Plantenberg D,et al．Duplicate data elimination in aSAN file system[A]∥Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies[C]．College Park,MD:IEEE,2004:301-314
[15] 去重和压缩[EB/OL].http://articles．e-works.net．cn/storage/article79873.htm,2010-08-24/2012-07-03
[16] Bolosky W J,Corbin S,et al．Single instance storage in windows 2000[A]∥Proceedings of the 4th USENIX Windows System Symposium[C]．Washington:USENIX,2000:13-24
[17] Tsuchiya Y,Watanabe T,et al．DBLK:Deduplication for Primary Block Storage[A]∥Proceedings of the 27th IEEE Symposium on Mass Storage Systems and Technologies[C]．Piscataway:IEEE,2011:1-5
[18] Denehy T E,Hsu W W．Duplicate management for reference data[R]．IBM Research Report,RJ 10305(A0310-017)．IBM Research Division,2003
[19] Bobbarjung D R,Jagannathan S,et al．Improving DuplicateElimination in Storage Systems[J]．ACM Transaction on Storage,2006,2(4):424-448
[20] Understanding data deduplication ratios [EB/OL]．http://www.snia.org/sites/default/files/Understanding_Data_Deduplication_Ratios-20080718.pdf,2008-07-18/2012-03-15
[21] Tan Yu-juan,Jiang Hong,et al．SAM:A Semantic-AwareMulti-Tiered Source De-duplication Framework for Cloud backup [A]∥Proceedings of the 39th International Conference on Parallel Processing[C].Los Alamitos,CA,USA:IEEE,2010:614-623
[22] Hash Collisions:The Real Odds[EB/OL].http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html,2007-10-14/2011-12-05
[23] Guo F,Efstathopoulos P．Building a high performance deduplication system[A]∥Proceedings of the 2011USENIX Annual Technical Conference [C]．Berkeley:USENIX,2011:25-25
[24] Zhu Benjamin,Li Kai,et al．Avoiding the disk bottleneck in the Data Domain deduplication file system[A]∥Proceedings of the 6th USENIX Conference on File and Storage Technologies [C]．Berkeley:USENIX,2008:269-282
[25] Quinlan S,Dorward S．Venti:A new approach to archival stora-ge[A]∥Proceedings of the FAST’02Conference on File and Storage Technologies[C]．Berkeley:USENIX,2002:89-101
[26] Lillibridge M,Eshghi K,et al．Sparse indexing:Large scale,inline deduplication using sampling and locality[A]∥Proceedings of the 7th USENIX Conference on File and Storage Technologies[C]．Berkeley:USENIX,2009:111-123
[27] Bhagwat D,Eshghi K,et al．Extreme Binning:Scalable,parallel deduplication for chunk-based file backup[A]∥Proceedings of the 17th IEEE International Symposium on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems[C]．London:IEEE,2009:1-9
[28] Xia Wen,Jiang Hong,et al．Accelerating Data De- duplication by Exploiting Pipelining and Parallelism with Multicore or Manycore Processors [EB/OL]．http://static.usenix.org/events/fast12/ poster_descriptions/Xiadescription.pdf,2012-3-2/2012-7-6
[29] Ousterhout J K,Agrawal P,et al．The case for RAMClouds:scalable high-performance storage entirely in DRAM[J].Opera-ting Systems Review,2009,43(4):92-105
[30] Bartizal D．Thomas Northfield．Solid State Drive PerformanceWhite Paper[EB/OL]．http://www.csee.umbc.edu/~squire/images/ssd2.pdf,2008-3-24/2012-6-7
[31] Benefits of SSD vs．HDD[EB/OL]．http://www.amplicon.com/docs/white-papers/SSD-vs-HDD-white-paper.pdf,2012-3-21/2012-7-8
[32] Solid State Drive vs．Hard Disk Drive Price and PerformanceStudy[EB/OL]．http://www．dell.com/downloads/global/products/pvaul/en/ssd_vs_hdd_price_and_performance_study.pdf,2011-5-1/2012-8-19
[33] Flash Memory Technology in Enterprise Storage Flexible Cho-ices to Optimize Performance [EB/ OL]．http://www.itdialogue.com/wp-content/ uploads/2010/04/Flash-in-Enterprise-Storage.pdf,2008-11-1/2012-3-4
[34] Debnath B,Sengupta S,et al．Chunkstash:speeding up in-linestorage deduplication using flash memory[A]∥Proceedings of the 2010USENIX Annual Technical Conference[C]．Boston:USENIX,2010:16-16
[35] The Art of Data Deduplication．http:// www.ecsl.cs.sunysb.edu/tr/rpe21.pdf
[36] Dubnicki C,Gryz L,et al．HYDRAstor:a scalable secondarystorage[A]∥Proceedings of the 7th USENIX Conference on File and Storage Echnologies [C]．Berkeley:USENIX,2009:197-210
[37] IBM System Storage N series Software Guide．http://www.redbooks.ibm.com/abstracts/sg247129.html,December 2010
[38] Alvarez C．NetApp deduplication for FAS and V-Series deployment and implementation guide[R]．Technical Report TR-3505．NetApp,2011
[39] EMC．Achieving storage efficiency through EMC Celerra data deduplication[M]．White paper,Mar．2010
[40] IBM Corporation．IBM white paper:IBM Storage Tank-A distributed storage system[M]．Jan．2002
[41] Kulkarni P,Douglis F,et al．Redundancy elimination withinlarge collections of files[A]∥Proceedings of the 2004USENIX Annual Technical Conference[C]．Boston:USENIX,2004:59-72
[42] You L L,Pollack K T,et al．Deep Store:An archival storagesystem architecture[A]∥Proceedings of the 21st International Conference on Data Engineering[C].Los Alamitos:IEEE,2005:804-815
[43] Jain N,Dahlin M,et al．TAPER:Tiered approach for eliminating redundancy in replica synchronization[A]∥Proceedings of the 5th USENIX Conference on File and Storage Technologies [C]．Berkeley:USENIX,2005:281-294
[44] Rhea S,Cox R,et al．Fast,inexpensive content-addressed storage in foundation[A]∥Proceedings of the 2008USENIX AnnualTechnical Conference[C]．Berkeley:USENIX,2008:143-156
[45] Meister D,Brinkmann A．dedupv1:Improving deduplication th-roughput using solid state drives (SSD)[A]∥Proceedings of the 26th IEEE Conference on Mass Storage Systems and Technologies[C]．Piscataway:IEEE,2010:1-6
[46] Dong W,Douglis F,et al．Tradeoffs in scalable data routing for deduplication clusters[A]∥Proceedings of the Ninth USENIX Conference on File and Storage Technologies [C]．Berkeley:USENIX,2011:15-29
[47] Xia W,Jiang H,et al．Silo:a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput[A]∥Proceedings of the 2011USENIX Annual Technical Con- ference[C]．Berkeley:USENIX,2011:26-28
[48] Zfs deduplication[EB/OL]．https://blogs．oracle.com/bonwick/entry/zfs_dedup,2009-11-01/2011.11.05
[49] Data striping[EB/OL]．http://en.wikipedia.org /wiki/Data_striping,2012-08-15/2012-08-23
[50] Reed-Solomon Codes [EB/OL]．http://hscc.cs．nthu.edu.tw/~sheujp/lecture_note/rs.pdf

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey on Data Deduplication Techniques for Storage Systems

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0