Robert Milkowski's blog: ZFS De-Duplication

Monday, March 31, 2008

ZFS De-Duplication

UPDATE: ZFS dedup finally integrated!

With integration of this RFE we are closer (hopefully) to ZFS buil-in de-duplication. Read more on Eric's blog.

Once ZFS re-writer and de-duplication are done in theory one should be able to do a zpool upgrade of current pool and de-dup all data which is already there... we will see :)

Eric mentioned on his blog that in reality we should use sha256 or stronger. I would go even further - two modes, one mode you depend entirely on block checksum and the other one where you actually compare byte-by-byte given block to be 100% sure they are the same. Slower for some workloads but safe.

Now, de-dup which "understands" your data would be even better (analyzing file contents - like emails, attachments and de-dup on attachment level, etc.), nevertheless block level one would be a good start.

6 comments:

Moinak said...: Full by-for-byte comparisons all the time will be slow but this can be avoided. Compare only checksums and if the checksums match then do a byte-for-byte comparison to be sure. Safe and less performance overhead.; 16 April, 2008 11:33
Anonymous said...: Furthermore, we as a community don't absolutely need "all the time" deduplication. Scheduled, limited-time deduplication, like "run for an hour at 1AM" would be both tractable and useful in many environments.; 27 June, 2008 17:34
Anonymous said...: "we as a community don't absolutely need "all the time" deduplication." -Anon.

Which community are you part of? Home users?

To gain acceptance in mid through enterprise level dedupe needs to happen all the time as part of the file system?; 02 October, 2008 20:21
Anonymous said...: "Which community are you part of? Home users?"

I would guess the NetApp one that's looking to save costs maybe? I know I am.; 12 March, 2009 12:30
James said...: Jeff and Bill will be presenting a keynote speech about ZFS deduplication at Kernel Conference Australia, in Brisbane in July 2009.; 18 May, 2009 11:02
Anonymous said...: Scheduled??? who would want their de-dup done at a specific time.. In addition, that would require more storage to hold the data and the de-duped data aswell. Inline de-duping is the only way to go!; 26 August, 2009 18:28