Saturday, December 13, 2014

ZFS: RAID-Z Resilvering

Solaris 11.2 introduced a new ZFS pool version: 35 Sequential resilver.

The new feature is supposed to make disk resilvering (disk replacement, hot-spare synchronization, etc.) much faster. It achieves it by reading ahead some meta data first and then by trying to read the data to be resilvered in a sequential manner. And it does work!

Here is a real world case, with real data - over 150mln different sized files, most relatively small. Many of them were deleted, new were written, etc. so I expect that the data is already fragmented in the pool. The server is Sun/Oracle x4-2l with 26x 1.2TB 2.5" 10k SAS disks. The 24 disks in front are presented in a pass-thru mode and managed by ZFS, configured as 3 RAID-Z pools, the other 2 disks in rear are configured in RADI-1 in the raid controller and used for OS. A disk in one of the pools failed, and hot-spare automatically attached:

# zpool status -x
  pool: XXXXXXXXXXXXXXXXXXXXXXX-0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Fri Dec 12 21:02:58 2014
    3.60T scanned
    45.9G resilvered at 342M/s, 9.96% done, 2h45m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        XXXXXXXXXXXXXXXXXXXXXXX-0    DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            spare-0                  DEGRADED     0     0     0
              c0t5000CCA01D5EAE50d0  UNAVAIL      0     0     0
              c0t5000CCA01D5EED34d0  DEGRADED     0     0     0  (resilvering)
            c0t5000CCA01D5BF56Cd0    ONLINE       0     0     0
            c0t5000CCA01D5E91B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F9B00d0    ONLINE       0     0     0
            c0t5000CCA01D5E87E4d0    ONLINE       0     0     0
            c0t5000CCA01D5E95B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F8244d0    ONLINE       0     0     0
            c0t5000CCA01D58B3A4d0    ONLINE       0     0     0
        spares
          c0t5000CCA01D5EED34d0      INUSE
          c0t5000CCA01D5E1E3Cd0      AVAIL

errors: No known data errors

Let's see I/O statistics for the involved disks:

# iostat -xnC 1 | egrep "device| c0$|c0t5000CCA01D5EAE50d0|c0t5000CCA01D5EED34d0..."
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 16651.6  503.9 478461.6 69423.4  0.2 26.3    0.0    1.5   1  19 c0
 2608.5    0.0 70280.3    0.0  0.0  1.6    0.0    0.6   3  36 c0t5000CCA01D5E95B0d0
 2582.5    0.0 66708.5    0.0  0.0  1.9    0.0    0.7   3  39 c0t5000CCA01D5F9B00d0
 2272.6    0.0 68571.0    0.0  0.0  2.9    0.0    1.3   2  50 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  503.9    0.0 69423.8  0.0  9.7    0.0   19.3   2 100 c0t5000CCA01D5EED34d0
 2503.5    0.0 66508.4    0.0  0.0  2.0    0.0    0.8   3  41 c0t5000CCA01D58B3A4d0
 2324.5    0.0 67093.8    0.0  0.0  2.1    0.0    0.9   3  44 c0t5000CCA01D5F8244d0
 2285.5    0.0 69192.3    0.0  0.0  2.3    0.0    1.0   2  45 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 1997.6    0.0 70006.0    0.0  0.0  3.3    0.0    1.6   2  54 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 25150.8  624.9 499295.4 73559.8  0.2 33.7    0.0    1.3   1  22 c0
 3436.4    0.0 68455.3    0.0  0.0  3.3    0.0    0.9   2  51 c0t5000CCA01D5E95B0d0
 3477.4    0.0 71893.7    0.0  0.0  3.0    0.0    0.9   3  48 c0t5000CCA01D5F9B00d0
 3784.4    0.0 72370.6    0.0  0.0  3.6    0.0    0.9   3  56 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  624.9    0.0 73559.8  0.0  9.4    0.0   15.1   2 100 c0t5000CCA01D5EED34d0
 3170.5    0.0 72167.9    0.0  0.0  3.5    0.0    1.1   2  55 c0t5000CCA01D58B3A4d0
 3881.4    0.0 72870.8    0.0  0.0  3.3    0.0    0.8   3  55 c0t5000CCA01D5F8244d0
 4252.3    0.0 70709.1    0.0  0.0  3.2    0.0    0.8   3  53 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 3063.5    0.0 70380.1    0.0  0.0  4.0    0.0    1.3   2  60 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 17190.2  523.6 502346.2 56121.6  0.2 31.0    0.0    1.8   1  18 c0
 2342.7    0.0 71913.8    0.0  0.0  2.9    0.0    1.2   3  43 c0t5000CCA01D5E95B0d0
 2306.7    0.0 72312.9    0.0  0.0  3.0    0.0    1.3   3  43 c0t5000CCA01D5F9B00d0
 2642.1    0.0 68822.9    0.0  0.0  2.9    0.0    1.1   3  45 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  523.6    0.0 56121.2  0.0  9.3    0.0   17.8   1 100 c0t5000CCA01D5EED34d0
 2257.7    0.0 71946.9    0.0  0.0  3.2    0.0    1.4   2  44 c0t5000CCA01D58B3A4d0
 2668.2    0.0 72685.4    0.0  0.0  2.9    0.0    1.1   3  43 c0t5000CCA01D5F8244d0
 2236.6    0.0 71829.5    0.0  0.0  3.3    0.0    1.5   3  47 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 2695.2    0.0 72395.4    0.0  0.0  3.2    0.0    1.2   3  45 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 31265.3  578.9 342935.3 53825.1  0.2 18.3    0.0    0.6   1  15 c0
 3748.0    0.0 48255.8    0.0  0.0  1.5    0.0    0.4   2  42 c0t5000CCA01D5E95B0d0
 4367.0    0.0 47278.2    0.0  0.0  1.1    0.0    0.3   2  35 c0t5000CCA01D5F9B00d0
 4706.1    0.0 50982.6    0.0  0.0  1.3    0.0    0.3   3  37 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  578.9    0.0 53824.8  0.0  9.7    0.0   16.8   1 100 c0t5000CCA01D5EED34d0
 4094.1    0.0 48077.3    0.0  0.0  1.2    0.0    0.3   2  35 c0t5000CCA01D58B3A4d0
 5030.1    0.0 47700.1    0.0  0.0  0.9    0.0    0.2   3  33 c0t5000CCA01D5F8244d0
 4939.9    0.0 52671.2    0.0  0.0  1.1    0.0    0.2   3  33 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 4380.1    0.0 47969.9    0.0  0.0  1.4    0.0    0.3   3  36 c0t5000CCA01D5BF56Cd0
^C

These are pretty amazing numbers for RAID-Z - and the only reason why a single disk drive can do so many thousands reads per second is that most of them have to be very almost ideally sequential. From time to time I see even more amazing numbers: 

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 73503.1 3874.0 53807.0 19166.6  0.3  9.8    0.0    0.1   1  16 c0
 9534.8    0.0 6859.5    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5E95B0d0
 9475.7    0.0 6969.1    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5F9B00d0
 9646.9    0.0 7176.4    0.0  0.0  0.4    0.0    0.0   3  31 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0 3478.6    0.0 18040.0  0.0  5.1    0.0    1.5   2  98 c0t5000CCA01D5EED34d0
 8213.4    0.0 6908.0    0.0  0.0  0.8    0.0    0.1   3  38 c0t5000CCA01D58B3A4d0
 9671.9    0.0 6860.5    0.0  0.0  0.4    0.0    0.0   3  30 c0t5000CCA01D5F8244d0
 8572.7    0.0 6830.0    0.0  0.0  0.7    0.0    0.1   3  35 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 18387.8    0.0 12203.5    0.0  0.1  0.7    0.0    0.0   7  57 c0t5000CCA01D5BF56Cd0

It is really good to see the new feature work so well in practice. This feature is what makes RAID-Z much more usable in many production environments. The other feature which complements this one and also makes RAID-Z much more practical to use is: RAID-Z/mirror hybrid allocator introduced in Solaris 11 (pool version 29). It makes accessing meta-data in RAID-Z much faster.

Both features are only available in Oracle Solaris 11 and not in OpenZFS deriviates.
Although OpenZFS has its own interesting new features as well.

No comments: