Monday, March 31, 2008

ZFS De-Duplication

UPDATE: ZFS dedup finally integrated!

With integration of this RFE we are closer (hopefully) to ZFS buil-in de-duplication. Read more on Eric's blog.


Once ZFS re-writer and de-duplication are done in theory one should be able to do a zpool upgrade of current pool and de-dup all data which is already there... we will see :)

Eric mentioned on his blog that in reality we should use sha256 or stronger. I would go even further - two modes, one mode you depend entirely on block checksum and the other one where you actually compare byte-by-byte given block to be 100% sure they are the same. Slower for some workloads but safe.

Now, de-dup which "understands" your data would be even better (analyzing file contents - like emails, attachments and de-dup on attachment level, etc.), nevertheless block level one would be a good start.

6 comments:

Moinak said...

Full by-for-byte comparisons all the time will be slow but this can be avoided. Compare only checksums and if the checksums match then do a byte-for-byte comparison to be sure. Safe and less performance overhead.

Anonymous said...

Furthermore, we as a community don't absolutely need "all the time" deduplication. Scheduled, limited-time deduplication, like "run for an hour at 1AM" would be both tractable and useful in many environments.

Anonymous said...

"we as a community don't absolutely need "all the time" deduplication." -Anon.

Which community are you part of? Home users?

To gain acceptance in mid through enterprise level dedupe needs to happen all the time as part of the file system?

Anonymous said...

"Which community are you part of? Home users?"

I would guess the NetApp one that's looking to save costs maybe? I know I am.

James said...

Jeff and Bill will be presenting a keynote speech about ZFS deduplication at Kernel Conference Australia, in Brisbane in July 2009.

Anonymous said...

Scheduled??? who would want their de-dup done at a specific time.. In addition, that would require more storage to hold the data and the de-duped data aswell. Inline de-duping is the only way to go!