Monday, March 19, 2007

ZFS online replication

During last Christmas I was playing with ZFS code again and I figured out that adding online replication of ZFS file systems should be quite easy to implement. By online replication I mean one-to-one relation between two file systems, potentially on different servers, and all modifications done to one file system are asynchronously replicated to the other one with a small delay (like few seconds). Additionally one should be able to snapshot remote file system independently to get point-in-time copies and resume replication from automatically created snapshots on both ends at given intervals. The good thing is that once you're just few seconds behind you should get all transactions from memory so you get a remote copy of your file system without generating any additional IOs on a backuped one.

Due to some reasons I haven't done it myself rather I asked one of my developers to actually implement such tool and here we are :)

bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
solaris 5.13G 11.6G 24.5K /solaris
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00# zfs create solaris/d100
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
solaris 5.13G 11.6G 26.5K /solaris
solaris/d100 24.5K 11.6G 24.5K /solaris/d100
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00#


Now in another terminal:


bash-3.00# ./zreplicate send solaris/d100 | ./zreplicate receive solaris/d100-copy


Back to original terminal:


bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
solaris 5.13G 11.6G 26.5K /solaris
solaris/d100 24.5K 11.6G 24.5K /solaris/d100
solaris/d100-copy 24.5K 11.6G 24.5K /solaris/d100-copy
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00#
bash-3.00# cp /platform/i86pc/boot_archive /solaris/d100/
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
solaris 5.15G 11.6G 26.5K /solaris
solaris/d100 12.0M 11.6G 12.0M /solaris/d100
solaris/d100-copy 12.0M 11.6G 12.0M /solaris/d100-copy
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00#
bash-3.00# rm /solaris/d100/boot_archive
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
solaris 5.14G 11.6G 26.5K /solaris
solaris/d100 24.5K 11.6G 24.5K /solaris/d100
solaris/d100-copy 12.0M 11.6G 12.0M /solaris/d100-copy
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
solaris 5.13G 11.6G 26.5K /solaris
solaris/d100 24.5K 11.6G 24.5K /solaris/d100
solaris/d100-copy 24.5K 11.6G 24.5K /solaris/d100-copy
solaris/testws 5.13G 11.6G 5.13G /export/testws/
bash-3.00#

bash-3.00# cp /platform/i86pc/boot_archive /solaris/d100/
[stop replication in another terminal]
bash-3.00# zfs mount -a
bash-3.00# digest -a md5 /solaris/d100/boot_archive
33e242158c6eb691d23ce2c522a7bf55
bash-3.00# digest -a md5 /solaris/d100-copy/boot_archive
33e242158c6eb691d23ce2c522a7bf55
bash-3.00#


Bingo! All modifications to solaris/d100 are automatically replicated to solaris/d100-copy. Of course you can replicate over the network to remote server using ssh.

There're still some minor problems but generally the tool works as expected.

Once the first phase is implemented we will probably start second one - to implement a tool to manage replications between servers (like automatic replication setup if new file system is created, replication resume in case of a problem, etc.).


There're other approaches which create a snapshots and then incrementally replicate them to remote side in given intervals. While our approach is very similar it's more elegant and gives you almost on-line replication. What do you think?

12 comments:

Mads said...

Sounds very, very handy.
Especially for redundancy with a passive node.
I would love to see that sort of functionality in zfs. Any chance of getting to try it?

Sam Freiberg said...

Ever since beginning the switch from Linux to Solaris I've been hoping for this very thing. Clustering is just too much of a pain when all you really want is file system replication. Are you going to be making this available?

milek said...

We're going to try to intyegrate it into Nevada. We should release code changes really soon now - watch zfs-discuss@ list.

Ed said...

Nice and simple!
For this hamstrung windows admin, it might be a good first-port-of-call before trying AVS 4.0
Keep up the good work!

James Mansion said...

Does it note which files and directories change and copy them, or does it copy changed blocks?

And is there any sense of ordering applied so that it won't destroy (say) a PostgreSQL or SQLite file if the copy is interrupted?

milek said...

It's an enhacement to ZFS in-kernel itself so all transactions are replicated in the same order they were applied to original file system.
More details soon.

af said...

Excellent stuff. Could this be used to replicate a zoned filesystem from the global zone?

milek said...

Sure, I see no reason why not.

asa said...

Has there been any movement on this issue? Any word on when it might make it into nevada?

Rudd-O said...

Come on guys, when will we have this? Will this make it into opensolaris? For what I do, AVS is unworkable due to the complexity of our work (thousands and thousands of volumes and disks), but this is LITERALLY IDEAL.

Rudd-O said...

BTW, will this work with ZVOLs as well?

milek said...

Rudd-O: we hit some issues back then and due to lack of time we abandon the approach. However you can find many scripts providing replication in zfs by utilizint snapshots and zfs send|recv which is mostly equivalent to what I was describing.