Wednesday, November 08, 2006

ZFS saved our data

Recently we migrated Linux NFS server to Solaris 10 NFS server with Sun Cluster 3.2 and ZFS. System has connected 2 SCSI JBODs and each node has 2 SCSI adapters, RAID-10 between JBODs and SCSI adapters was created using ZFS. We did use rsync to migrate data. During migration we noticed in system logs that one of SCSI adapters reported some warnings from time to time. Then more serious warnings about bad firmware or broken adapter - but data kept writing. When we run rsync again ZFS reported some checksum errors but only on disks which were connected to bad adapter. I run scrub on entire pool and ZFS reported and corrected thousands of checksum errors - all of them on a bad controller. We removed bad controller and reconnected JBOD to good one, run scrub again - this time no errors. Then we completed data migration. So far everything works ok and no checksum error are reported by ZFS.

Important thing here is that ZFS detected that bad SCSI adapter was actually corrupting data and ZFS was able to correct that on-the-fly so we didn't have to start from the beginning. Also if it was classic file system we probably wouldn't have even notice that our data were corrupted until system panic or fsck needed. Also as there were so many errors probably fsck wouldn't help for file system consistency not to mention that it wouldn't correct bad data at all.

2 comments:

  1. Hi,

    I've got a few questions about Sun Cluster 3.2, I googled a lot but didn't find anything interesting... maybe you could help? :)

    1. SC 3.1 had a limit of two nodes on x86 machines, is it still so with 3.2?

    2. Sun's FAQ about SC 3.1 states that SC won't run on non-Sun x86 machines. I didn't have an opportunity to test it on a non-Sun system, did you? I'd very much like to set up a Sun Cluster on a custom x86 machine. Maybe 3.2 doesn't have this limitation?

    3. Does SC 3.2 have a Data Service for Apache HTTP server (not Tomcat) on x86?

    Thanks in advance!

    ReplyDelete
  2. 1. I think I saw somewhere info that 2 node restriction on x86 was released. However I can't find that info right now.

    2. I've never actually tried it on x86 but I guess it will work on non-Sun x86/x64 hardware as long as Solaris works on them

    3. Yes, there's apache data service

    ReplyDelete