Tuesday, November 12, 2013

ZFS Appliance

While building a storage appliance based on ZFS, one of the important features is an ability to identify physical disk locations, which is hard to do for SAS disks and easier for SATA. Solaris 11 has a topology framework which makes it much easier and it is nicely integrated with various subsystems like FMA, ZFS. Recently I came across this blog entry which highlights this specific issue of how to identify physical disk locations.

The other important factor is easy of use - in order to replace a failed disk drive one should be able to pull out a bad one, put in a replacement one and that's it - all the rest should be done automatically. There shouldn't be any need to login to the OS and issue some commands to assist with the replacement. Again, this is how things are in Solaris 11.
 
Let's see how it works in practice. Recently one disk reported two read errors and multiple checksum errors during zpool scrub. Because the affected pool is redundant (RAID-10), ZFS was able to detect the corruption, serve good data from other disk and fix the corrupted blocks on the affected disk.
The story doesn't end here though - FMA decided that too many checksum errors were reported for a single disk, so it activated a hot-spare as a precaution to pro-actively protect from the bad disk would  misbehaving again. This is how the pool looked like after the hot-spare fully attached: 

# zpool status -v pool-0
  pool: pool-0
state: DEGRADED
status: One or more devices has been diagnosed as degraded. An attempt
        was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
  scan: resilvered 458G in 2h23m with 0 errors on Sat Nov  2 07:53:42 2013
config:

        NAME                         STATE     READ WRITE CKSUM
        pool-0                       DEGRADED     0     0     0
          mirror-0                   ONLINE       0     0     0
            c0t5000CCA0165FC0F8d0    ONLINE       0     0     0
            c0t5000CCA016217040d0    ONLINE       0     0     0
          mirror-1                   ONLINE       0     0     0
            c0t5000CCA01666AB64d0    ONLINE       0     0     0
            c0t5000CCA0166F3BB8d0    ONLINE       0     0     0
          mirror-2                   ONLINE       0     0     0
            c0t5000CCA0166F36C8d0    ONLINE       0     0     0
            c0t5000CCA01661894Cd0    ONLINE       0     0     0
          mirror-3                   ONLINE       0     0     0
            c0t5000CCA0166BE338d0    ONLINE       0     0     0
            c0t5000CCA016626340d0    ONLINE       0     0     0
          mirror-4                   ONLINE       0     0     0
            c0t5000CCA0166DC81Cd0    ONLINE       0     0     0
            c0t5000CCA016685238d0    ONLINE       0     0     0
          mirror-5                   ONLINE       0     0     0
            c0t5000CCA016636CA4d0    ONLINE       0     0     0
            c0t5000CCA016687528d0    ONLINE       0     0     0
          mirror-6                   ONLINE       0     0     0
            c0t5000CCA0166DC944d0    ONLINE       0     0     0
            c0t5000CCA0166DC0CCd0    ONLINE       0     0     0
          mirror-7                   ONLINE       0     0     0
            c0t5000CCA0166F4178d0    ONLINE       0     0     0
            c0t5000CCA01668DCC0d0    ONLINE       0     0     0
          mirror-8                   DEGRADED     0     0     0
            c0t5000CCA0166DD7BCd0    ONLINE       0     0     0
            spare-1                  DEGRADED     0     0     0
              c0t5000CCA016671600d0  DEGRADED     2     0    49
              c0t5000CCA0166876F8d0  ONLINE       0     0     0
          mirror-9                   ONLINE       0     0     0
            c0t5000CCA0166DC20Cd0    ONLINE       0     0     0
            c0t5000CCA0166877BCd0    ONLINE       0     0     0
          mirror-10                  ONLINE       0     0     0
            c0t5000CCA0166F3334d0    ONLINE       0     0     0
            c0t5000CCA0166BDD2Cd0    ONLINE       0     0     0
        spares
          c0t5000CCA0166876F8d0      INUSE
          c0t5000CCA0166DCAACd0      AVAIL

device details:

        c0t5000CCA016671600d0      DEGRADED       too many errors
        status: FMA has degraded this device.
        action: Run 'fmadm faulty' for more information. Clear the errors
                using 'fmadm repaired'.
           see: http://support.oracle.com/msg/DISK-8000-D5 for recovery


errors: No known data errors

See that mirror-8 vdev is a 3-way mirror now - as the affected disk is still functional it wasn't detached, but just in case it goes really bad we have a hot spare now forming a 3-way mirror. Below is what FMA reported:

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Nov 02 05:29:39 d10c88f7-8e31-ce12-ab5c-8a759cf875c3  DISK-8000-D5   Major

Problem Status    : solved
Diag Engine       : eft / 1.16
System
    Manufacturer  : Oracle-Corporation
    Name          : SUN-FIRE-X4270-M3
    Part_Number   : 31792382+1+1
    Serial_Number : XXXXXXXX
    Host_ID       : 004858f6

----------------------------------------
Suspect 1 of 1 :
   Fault class : fault.io.scsi.disk.csum-zfs.transient
   Certainty   : 100%
   Affects     : dev:///:devid=id1,sd@n5000cca016671600//scsi_vhci/disk@g5000cca016671600
   Status      : faulted but still providing degraded service

   FRU
     Location         : "HDD17"
     Manufacturer     : HITACHI
     Name             : H109090SESUN900G
     Part_Number      : HITACHI-H109090SESUN900G
     Revision         : A31A
     Serial_Number    : XXXXXXXX
     Chassis
        Manufacturer  : Oracle-Corporation
        Name          : SUN-FIRE-X4270-M3
        Part_Number   : 31792382+1+1
        Serial_Number : XXXXXXXX
        Status        : faulty

Description : There have been excessive transient ZFS checksum errors on this
              disk.

Response    : A hot-spare disk may have been activated.

Impact      : If a hot spare is available it will be brought online and during
              this time I/O could be impacted. If a hot spare isn't available
              then I/O could be lost and data corruption is possible.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/DISK-8000-D5 for the latest service
              procedures and policies regarding this diagnosis.

Notice that FMA is reporting that the affected disk location is HDD17 - this corresponds to HDD17 slot on the x3-2l server. That way we know exactly which disk to replace. We can also get physical disk locations from zpool status command:

# zpool status -l pool-0
  pool: pool-0
state: DEGRADED
status: One or more devices has been diagnosed as degraded. An attempt
        was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
        Run 'zpool status -v' to see device specific details.
  scan: resilvered 458G in 2h23m with 0 errors on Sat Nov  2 07:53:42 2013
config:

        NAME                               STATE     READ WRITE CKSUM
        pool-0                             DEGRADED     0     0     0
          mirror-0                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD00/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD01/disk    ONLINE       0     0     0
          mirror-1                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD02/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD03/disk    ONLINE       0     0     0
          mirror-2                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD04/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD05/disk    ONLINE       0     0     0
          mirror-3                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD06/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD07/disk    ONLINE       0     0     0
          mirror-4                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD08/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD09/disk    ONLINE       0     0     0
          mirror-5                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD10/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD11/disk    ONLINE       0     0     0
          mirror-6                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD12/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD13/disk    ONLINE       0     0     0
          mirror-7                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD14/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD15/disk    ONLINE       0     0     0
          mirror-8                         DEGRADED     0     0     0
            /dev/chassis/SYS/HDD16/disk    ONLINE       0     0     0
            spare-1                        DEGRADED     0     0     0       
              /dev/chassis/SYS/HDD17/disk  DEGRADED     2     0    49
              /dev/chassis/SYS/HDD22/disk  ONLINE       0     0     0
          mirror-9                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD18/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD19/disk    ONLINE       0     0     0
          mirror-10                        ONLINE       0     0     0
            /dev/chassis/SYS/HDD20/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD21/disk    ONLINE       0     0     0
        spares
          /dev/chassis/SYS/HDD22/disk      INUSE
          /dev/chassis/SYS/HDD23/disk      AVAIL

errors: No known data errors

Now it is up to us if we want to run a scrub and wait few days and if there are no new errors clear the pool status and deactivate hot spare, or if we don't want to take any chances and replace the affected disk drive. We decided to replace it. The disk in bay 17 of x3-2l was physically pulled out, a replacement was put in in its place and since we have autoreplace property set to true on the pool, FMA/ZFS automatically put an EFI label on the new disk and attached it to the pool, once it fully synchronized a hot spare was mas detached and made available again. We didn't have to login to the OS and co-ordinate in any way with the physical disk replacement. Here is how zpool history looked like: 

# zpool history -i pool-0
…
2013-11-06.20:38:03 [internal pool scrub txg:1709498] func=2 mintxg=3 maxtxg=1709499 logs=0
2013-11-06.20:38:17 [internal vdev attach txg:1709501] replace vdev=/dev/dsk/c0t5000CCA016217834d0s0 \
                                                           for vdev=/dev/dsk/c0t5000CCA016671600d0s0
2013-11-06.23:01:33 [internal pool scrub done txg:1710852] complete=1 logs=0
2013-11-06.23:01:34 [internal vdev detach txg:1710854] vdev=/dev/dsk/c0t5000CCA016671600d0s0
2013-11-06.23:01:39 [internal vdev detach txg:1710855] vdev=/dev/dsk/c0t5000CCA0166876F8d0s0

Let's see the pool status after the replacement disk fully synchronized:

# zpool status -l pool-0
  pool: pool-0
state: ONLINE
  scan: resilvered 459G in 2h23m with 0 errors on Wed Nov  6 23:01:34 2013
config:

        NAME                             STATE     READ WRITE CKSUM
        pool-0                           ONLINE       0     0     0
          mirror-0                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD00/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD01/disk  ONLINE       0     0     0
          mirror-1                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD02/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD03/disk  ONLINE       0     0     0
          mirror-2                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD04/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD05/disk  ONLINE       0     0     0
          mirror-3                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD06/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD07/disk  ONLINE       0     0     0
          mirror-4                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD08/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD09/disk  ONLINE       0     0     0
          mirror-5                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD10/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD11/disk  ONLINE       0     0     0
          mirror-6                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD12/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD13/disk  ONLINE       0     0     0
          mirror-7                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD14/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD15/disk  ONLINE       0     0     0
          mirror-8                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD16/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD17/disk  ONLINE       0     0     0
          mirror-9                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD18/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD19/disk  ONLINE       0     0     0
          mirror-10                      ONLINE       0     0     0
            /dev/chassis/SYS/HDD20/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD21/disk  ONLINE       0     0     0
        spares
          /dev/chassis/SYS/HDD22/disk    AVAIL
          /dev/chassis/SYS/HDD23/disk    AVAIL

errors: No known data errors

All is back to normal. We can also check if all disks, including the new one, are of the same part number, firmware level, etc.

# diskinfo -t disk -o Rcmenf1 R:receptacle-name c:occupant-compdev m:occupant-mfg e:occupant-model n:occupant-part f:occupant-firm 1:occupant-capacity ----------------- --------------------- -------------- ---------------- ------------------------ --------------- ------------------- SYS/HDD00 c0t5000CCA0165FC0F8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD01 c0t5000CCA016217040d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD02 c0t5000CCA01666AB64d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD03 c0t5000CCA0166F3BB8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD04 c0t5000CCA0166F36C8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD05 c0t5000CCA01661894Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD06 c0t5000CCA0166BE338d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD07 c0t5000CCA016626340d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD08 c0t5000CCA0166DC81Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD09 c0t5000CCA016685238d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD10 c0t5000CCA016636CA4d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD11 c0t5000CCA016687528d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD12 c0t5000CCA0166DC944d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD13 c0t5000CCA0166DC0CCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD14 c0t5000CCA0166F4178d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD15 c0t5000CCA01668DCC0d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD16 c0t5000CCA0166DD7BCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD17 c0t5000CCA016217834d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD18 c0t5000CCA0166DC20Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD19 c0t5000CCA0166877BCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD20 c0t5000CCA0166F3334d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD21 c0t5000CCA0166BDD2Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD22 c0t5000CCA0166876F8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216 SYS/HDD23 c0t5000CCA0166DCAACd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216

This is a a very cool integration of different features in Solaris 11 which makes ZFS based solution much more reliable and easy to support.

The topology framework should work out of the box on Oracle servers and Solaris 11. The above example is from Solaris 11.1 + SRU10 running on X3-2L server with 24 disks in front (and another 2x in rear for OS, mirrored by the controller itself). It has a simple HBA which presents all of the front disks as JBODs which is perfect for ZFS.

The topology framework also works on 3rd party hardware, but depending on a particular set-up some additional configuration steps might be required, like defining Bay Labels. If disks are behind SAS expander then it is more complicated to get it working and I'm not sure if there is a documented procedure describing how to do it.

Friday, November 01, 2013

SYNCHRONIZE_CACHE on close()

Recently while testing iSCSI I noticed that when you close a raw device, which is an iSCSI target, Solaris sends SCSI SYNCHRONIZE command on close().
$ dd if=/dev/zero of=/dev/rdsk/c0t6537643965643539d0s0 bs=1b count=1
1+0 records in
1+0 records out


$ dtrace -n fbt::*SYNCHRONIZE_CACHE:entry'{printf("%s %d\n", execname, pid);stack();}'
dtrace: description 'fbt::*SYNCHRONIZE_CACHE:entry' matched 1 probe

dd 2562

              sd`sdclose+0x1c0
              genunix`dev_close+0x55
              specfs`device_close+0xb3
              specfs`spec_close+0x171
              genunix`fop_close+0x9f
              genunix`closef+0x68
              genunix`closeandsetf+0x5be
              genunix`close+0x18
              unix`_sys_sysenter_post_swapgs+0x149