Robert Milkowski's blog: 2013

Tuesday, November 12, 2013

ZFS Appliance

While building a storage appliance based on ZFS, one of the important features is an ability to identify physical disk locations, which is hard to do for SAS disks and easier for SATA. Solaris 11 has a topology framework which makes it much easier and it is nicely integrated with various subsystems like FMA, ZFS. Recently I came across this blog entry which highlights this specific issue of how to identify physical disk locations.

The other important factor is easy of use - in order to replace a failed disk drive one should be able to pull out a bad one, put in a replacement one and that's it - all the rest should be done automatically. There shouldn't be any need to login to the OS and issue some commands to assist with the replacement. Again, this is how things are in Solaris 11.

Let's see how it works in practice. Recently one disk reported two read errors and multiple checksum errors during zpool scrub. Because the affected pool is redundant (RAID-10), ZFS was able to detect the corruption, serve good data from other disk and fix the corrupted blocks on the affected disk.

The story doesn't end here though - FMA decided that too many checksum errors were reported for a single disk, so it activated a hot-spare as a precaution to pro-actively protect from the bad disk would misbehaving again. This is how the pool looked like after the hot-spare fully attached:

# zpool status -v pool-0
  pool: pool-0
state: DEGRADED
status: One or more devices has been diagnosed as degraded. An attempt
        was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
  scan: resilvered 458G in 2h23m with 0 errors on Sat Nov  2 07:53:42 2013
config:

        NAME                         STATE     READ WRITE CKSUM
        pool-0                       DEGRADED     0     0     0
          mirror-0                   ONLINE       0     0     0
            c0t5000CCA0165FC0F8d0    ONLINE       0     0     0
            c0t5000CCA016217040d0    ONLINE       0     0     0
          mirror-1                   ONLINE       0     0     0
            c0t5000CCA01666AB64d0    ONLINE       0     0     0
            c0t5000CCA0166F3BB8d0    ONLINE       0     0     0
          mirror-2                   ONLINE       0     0     0
            c0t5000CCA0166F36C8d0    ONLINE       0     0     0
            c0t5000CCA01661894Cd0    ONLINE       0     0     0
          mirror-3                   ONLINE       0     0     0
            c0t5000CCA0166BE338d0    ONLINE       0     0     0
            c0t5000CCA016626340d0    ONLINE       0     0     0
          mirror-4                   ONLINE       0     0     0
            c0t5000CCA0166DC81Cd0    ONLINE       0     0     0
            c0t5000CCA016685238d0    ONLINE       0     0     0
          mirror-5                   ONLINE       0     0     0
            c0t5000CCA016636CA4d0    ONLINE       0     0     0
            c0t5000CCA016687528d0    ONLINE       0     0     0
          mirror-6                   ONLINE       0     0     0
            c0t5000CCA0166DC944d0    ONLINE       0     0     0
            c0t5000CCA0166DC0CCd0    ONLINE       0     0     0
          mirror-7                   ONLINE       0     0     0
            c0t5000CCA0166F4178d0    ONLINE       0     0     0
            c0t5000CCA01668DCC0d0    ONLINE       0     0     0
          mirror-8                   DEGRADED     0     0     0
            c0t5000CCA0166DD7BCd0    ONLINE       0     0     0
            spare-1                  DEGRADED     0     0     0
              c0t5000CCA016671600d0  DEGRADED     2     0    49
              c0t5000CCA0166876F8d0  ONLINE       0     0     0
          mirror-9                   ONLINE       0     0     0
            c0t5000CCA0166DC20Cd0    ONLINE       0     0     0
            c0t5000CCA0166877BCd0    ONLINE       0     0     0
          mirror-10                  ONLINE       0     0     0
            c0t5000CCA0166F3334d0    ONLINE       0     0     0
            c0t5000CCA0166BDD2Cd0    ONLINE       0     0     0
        spares
          c0t5000CCA0166876F8d0      INUSE
          c0t5000CCA0166DCAACd0      AVAIL

device details:

        c0t5000CCA016671600d0      DEGRADED       too many errors
        status: FMA has degraded this device.
        action: Run 'fmadm faulty' for more information. Clear the errors
                using 'fmadm repaired'.
           see: http://support.oracle.com/msg/DISK-8000-D5 for recovery


errors: No known data errors

See that mirror-8 vdev is a 3-way mirror now - as the affected disk is still functional it wasn't detached, but just in case it goes really bad we have a hot spare now forming a 3-way mirror. Below is what FMA reported:

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Nov 02 05:29:39 d10c88f7-8e31-ce12-ab5c-8a759cf875c3  DISK-8000-D5   Major

Problem Status    : solved
Diag Engine       : eft / 1.16
System
    Manufacturer  : Oracle-Corporation
    Name          : SUN-FIRE-X4270-M3
    Part_Number   : 31792382+1+1
    Serial_Number : XXXXXXXX
    Host_ID       : 004858f6

----------------------------------------
Suspect 1 of 1 :
   Fault class : fault.io.scsi.disk.csum-zfs.transient
   Certainty   : 100%
   Affects     : dev:///:devid=id1,sd@n5000cca016671600//scsi_vhci/disk@g5000cca016671600
   Status      : faulted but still providing degraded service

   FRU
     Location         : "HDD17"
     Manufacturer     : HITACHI
     Name             : H109090SESUN900G
     Part_Number      : HITACHI-H109090SESUN900G
     Revision         : A31A
     Serial_Number    : XXXXXXXX
     Chassis
        Manufacturer  : Oracle-Corporation
        Name          : SUN-FIRE-X4270-M3
        Part_Number   : 31792382+1+1
        Serial_Number : XXXXXXXX
        Status        : faulty

Description : There have been excessive transient ZFS checksum errors on this
              disk.

Response    : A hot-spare disk may have been activated.

Impact      : If a hot spare is available it will be brought online and during
              this time I/O could be impacted. If a hot spare isn't available
              then I/O could be lost and data corruption is possible.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/DISK-8000-D5 for the latest service
              procedures and policies regarding this diagnosis.

Notice that FMA is reporting that the affected disk location is HDD17 - this corresponds to HDD17 slot on the x3-2l server. That way we know exactly which disk to replace. We can also get physical disk locations from zpool status command:

# zpool status -l pool-0
  pool: pool-0
state: DEGRADED
status: One or more devices has been diagnosed as degraded. An attempt
        was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
        Run 'zpool status -v' to see device specific details.
  scan: resilvered 458G in 2h23m with 0 errors on Sat Nov  2 07:53:42 2013
config:

        NAME                               STATE     READ WRITE CKSUM
        pool-0                             DEGRADED     0     0     0
          mirror-0                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD00/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD01/disk    ONLINE       0     0     0
          mirror-1                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD02/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD03/disk    ONLINE       0     0     0
          mirror-2                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD04/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD05/disk    ONLINE       0     0     0
          mirror-3                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD06/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD07/disk    ONLINE       0     0     0
          mirror-4                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD08/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD09/disk    ONLINE       0     0     0
          mirror-5                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD10/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD11/disk    ONLINE       0     0     0
          mirror-6                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD12/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD13/disk    ONLINE       0     0     0
          mirror-7                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD14/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD15/disk    ONLINE       0     0     0
          mirror-8                         DEGRADED     0     0     0
            /dev/chassis/SYS/HDD16/disk    ONLINE       0     0     0
            spare-1                        DEGRADED     0     0     0       
              /dev/chassis/SYS/HDD17/disk  DEGRADED     2     0    49
              /dev/chassis/SYS/HDD22/disk  ONLINE       0     0     0
          mirror-9                         ONLINE       0     0     0
            /dev/chassis/SYS/HDD18/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD19/disk    ONLINE       0     0     0
          mirror-10                        ONLINE       0     0     0
            /dev/chassis/SYS/HDD20/disk    ONLINE       0     0     0
            /dev/chassis/SYS/HDD21/disk    ONLINE       0     0     0
        spares
          /dev/chassis/SYS/HDD22/disk      INUSE
          /dev/chassis/SYS/HDD23/disk      AVAIL

errors: No known data errors

Now it is up to us if we want to run a scrub and wait few days and if there are no new errors clear the pool status and deactivate hot spare, or if we don't want to take any chances and replace the affected disk drive. We decided to replace it. The disk in bay 17 of x3-2l was physically pulled out, a replacement was put in in its place and since we have autoreplace property set to true on the pool, FMA/ZFS automatically put an EFI label on the new disk and attached it to the pool, once it fully synchronized a hot spare was mas detached and made available again. We didn't have to login to the OS and co-ordinate in any way with the physical disk replacement. Here is how zpool history looked like:

# zpool history -i pool-0
…
2013-11-06.20:38:03 [internal pool scrub txg:1709498] func=2 mintxg=3 maxtxg=1709499 logs=0
2013-11-06.20:38:17 [internal vdev attach txg:1709501] replace vdev=/dev/dsk/c0t5000CCA016217834d0s0 \

                                                           for vdev=/dev/dsk/c0t5000CCA016671600d0s0
2013-11-06.23:01:33 [internal pool scrub done txg:1710852] complete=1 logs=0
2013-11-06.23:01:34 [internal vdev detach txg:1710854] vdev=/dev/dsk/c0t5000CCA016671600d0s0
2013-11-06.23:01:39 [internal vdev detach txg:1710855] vdev=/dev/dsk/c0t5000CCA0166876F8d0s0

Let's see the pool status after the replacement disk fully synchronized:

# zpool status -l pool-0
  pool: pool-0
state: ONLINE
  scan: resilvered 459G in 2h23m with 0 errors on Wed Nov  6 23:01:34 2013
config:

        NAME                             STATE     READ WRITE CKSUM
        pool-0                           ONLINE       0     0     0
          mirror-0                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD00/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD01/disk  ONLINE       0     0     0
          mirror-1                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD02/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD03/disk  ONLINE       0     0     0
          mirror-2                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD04/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD05/disk  ONLINE       0     0     0
          mirror-3                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD06/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD07/disk  ONLINE       0     0     0
          mirror-4                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD08/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD09/disk  ONLINE       0     0     0
          mirror-5                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD10/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD11/disk  ONLINE       0     0     0
          mirror-6                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD12/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD13/disk  ONLINE       0     0     0
          mirror-7                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD14/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD15/disk  ONLINE       0     0     0
          mirror-8                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD16/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD17/disk  ONLINE       0     0     0
          mirror-9                       ONLINE       0     0     0
            /dev/chassis/SYS/HDD18/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD19/disk  ONLINE       0     0     0
          mirror-10                      ONLINE       0     0     0
            /dev/chassis/SYS/HDD20/disk  ONLINE       0     0     0
            /dev/chassis/SYS/HDD21/disk  ONLINE       0     0     0
        spares
          /dev/chassis/SYS/HDD22/disk    AVAIL
          /dev/chassis/SYS/HDD23/disk    AVAIL

errors: No known data errors

All is back to normal. We can also check if all disks, including the new one, are of the same part number, firmware level, etc.

# diskinfo -t disk -o Rcmenf1
R:receptacle-name  c:occupant-compdev     m:occupant-mfg  e:occupant-model  n:occupant-part           f:occupant-firm  1:occupant-capacity
-----------------  ---------------------  --------------  ----------------  ------------------------  ---------------  -------------------
SYS/HDD00          c0t5000CCA0165FC0F8d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD01          c0t5000CCA016217040d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD02          c0t5000CCA01666AB64d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD03          c0t5000CCA0166F3BB8d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD04          c0t5000CCA0166F36C8d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD05          c0t5000CCA01661894Cd0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD06          c0t5000CCA0166BE338d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD07          c0t5000CCA016626340d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD08          c0t5000CCA0166DC81Cd0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD09          c0t5000CCA016685238d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD10          c0t5000CCA016636CA4d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD11          c0t5000CCA016687528d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD12          c0t5000CCA0166DC944d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD13          c0t5000CCA0166DC0CCd0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD14          c0t5000CCA0166F4178d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD15          c0t5000CCA01668DCC0d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD16          c0t5000CCA0166DD7BCd0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD17          c0t5000CCA016217834d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD18          c0t5000CCA0166DC20Cd0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD19          c0t5000CCA0166877BCd0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD20          c0t5000CCA0166F3334d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD21          c0t5000CCA0166BDD2Cd0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD22          c0t5000CCA0166876F8d0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216
SYS/HDD23          c0t5000CCA0166DCAACd0  HITACHI         H109090SESUN900G  HITACHI-H109090SESUN900G  A31A             900185481216

This is a a very cool integration of different features in Solaris 11 which makes ZFS based solution much more reliable and easy to support.

The topology framework should work out of the box on Oracle servers and Solaris 11. The above example is from Solaris 11.1 + SRU10 running on X3-2L server with 24 disks in front (and another 2x in rear for OS, mirrored by the controller itself). It has a simple HBA which presents all of the front disks as JBODs which is perfect for ZFS.

The topology framework also works on 3rd party hardware, but depending on a particular set-up some additional configuration steps might be required, like defining Bay Labels. If disks are behind SAS expander then it is more complicated to get it working and I'm not sure if there is a documented procedure describing how to do it.

Friday, November 01, 2013

SYNCHRONIZE_CACHE on close()

Recently while testing iSCSI I noticed that when you close a raw device, which is an iSCSI target, Solaris sends SCSI SYNCHRONIZE command on close().

$ dd if=/dev/zero of=/dev/rdsk/c0t6537643965643539d0s0 bs=1b count=1
1+0 records in
1+0 records out


$ dtrace -n fbt::*SYNCHRONIZE_CACHE:entry'{printf("%s %d\n", execname, pid);stack();}'
dtrace: description 'fbt::*SYNCHRONIZE_CACHE:entry' matched 1 probe

dd 2562

              sd`sdclose+0x1c0
              genunix`dev_close+0x55
              specfs`device_close+0xb3
              specfs`spec_close+0x171
              genunix`fop_close+0x9f
              genunix`closef+0x68
              genunix`closeandsetf+0x5be
              genunix`close+0x18
              unix`_sys_sysenter_post_swapgs+0x149

Thursday, October 24, 2013

ZFS Internals

Early December there will be ZFS Internals course in London - there are still places available. If you are interested in learning DTrace then there will be DTrace course a week earlier as well.

Friday, October 04, 2013

Solaris 11: IDRs

One of the lacking features in Solaris 11 was the way it dealt with IDRs when performing OS update. Essentially one had to manually back out an IDR to perform an update as a separate step before an update or by adding a --reject IDRxxx (but this wouldn't necessarily be 100% correct). However, recently Oracle started publishing IDRs in their support repo once they are integrated as an obsoleting package - long story short, it means that now one can just update OS without worrying how to back out an IDR if it has been fixed in a later release. For more details see here.

Sunday, September 15, 2013

The Importance of Hiring

Adam wrote about his experience at Delphix - the main point I agree with is the importance of hiring the right people. It takes time but it pays off in the long term.

Friday, August 30, 2013

Coming Back of Big Iron?

In the past decade servers have become boring - almost everything runs on a cheap x86 servers which mainly differ by color. Now, 96 CPU sockets, 152 cores and 9,216 threads and up-to 96TB of RAM in a server? I wouldn't mind to play with such a monster... read more

Friday, August 16, 2013

What's New in OpenZFS

Matt Ahrens talks about new features in Open (Illumos) ZFS. Some of the performance improvements Matt is talking about have their equivalents in Solaris 11 for some time now though and there are many more, for example:

6282155 arc doesn't always need to make a copy
15613053 : SUNBT6913905 IMPLEMENT ATA TRIM, SCSI WRITE SAME / UNMAP , THIN RECLAMATION
6281079 ZFS I/O priority inversion
6914162 Dedup of null blocks could use special treatment
6662450 L2ARC in memory overhead should be reduced
6957289 ARC metadata limit can have serious adverse effect on dedup performance
6896307 arc_meta_limit modernization
...

Then Solaris 11 ZFS also has encryption, up-to 1MB recordsize, RAID-Z/mirror hybrid allocator, it does support 4k sector, and there are more new features and improvements to ZFS. Then I like LZ4 support in Illumos which is not in Solaris 11... It is good to see though that both Illumos and Oracle are innovating around ZFS. From end-user perspective it is a shame in a way that they do not actively share code though. On the other hand a little bit of competition might be good after all. We will see. ps. notice that the video contains more updates on different technologies around Illumos and they are worth watching as well.

Deduplication on ZFS and NetApp

Recently I came across a case where deduplication ratio for the same data is lower on NetApps than on ZFS. This document maybe explains why - see limits for dedup on NetApp starting with page 26. Apparently NetApp will silently stop deduping data after a specific limit which varies for different models and Ontap versions is reached.

Anyone has some other ideas why effectiveness of dedup on ZFS might be higher for the same data? (assuming the same or similar blocksize).

Wednesday, August 14, 2013

ZFS/SLOG on SAN

What if you have ZFS deployed on SAN in a clustered environment and you require a dedicated SLOG? It would be really helpful if you could create a small LUN (2-4GB) directly out of disk array's cache. This would be perfect for SLOG. All reads and writes to such LUN would be serviced entirely from array's cache - meaning low latency and no double-writes to backend disks/SSDs for synchronous I/O. Some disk arrays actually do provide such feature, for example see Hitachi's Cache Residency Manager.

Wednesday, June 26, 2013

Manta

Joyent announced Manta -Object Store with local compute. Pretty cool.
You can read more about it here and here and here. And if you want to see how many times Bryan can write f** then read this one as well.

But seriously, it looks really interesting.

Tuesday, May 28, 2013

NSS_OPTIONS

Sometimes developers put undocumented options in their code to help with debugging issues. This morning I came across one of such options which prints extra debug information when executing NSS queries.

First you need to disable nscd:

# svcadm disable -t name-service/cache

Then you need to set env variable debug_eng_loop to >0, for example:

# NSS_OPTIONS="debug_eng_loop=2" ping -I 1 wp.pl
NSS_retry(0): 'ipnodes': trying 'files' ... result=NOTFOUND, action=CONTINUE
NSS: 'ipnodes': continue ...
NSS_retry(0): 'ipnodes': trying 'dns' ... result=SUCCESS, action=RETURN
NSS: 'ipnodes': return.
PING wp.pl: 56 data bytes

What's good about it is that it tells you which database configured in NSS returned the result.

Tuesday, May 21, 2013

Setting RPATH

Today I was made aware that elfedit tool in Solaris 11 allows for setting RPATH (among other things). The only caveat is that a binary had to be linked on Solaris 11. It is very easy to use:

 # elfedit -e 'dyn:runpath $ORIGIN/../lib' /opt/bin/myprog

There is a nice blog entry about it from Ali Bahrami.

Saturday, May 18, 2013

Btrfs: "are we there yet?"

Not in the foreseeable future...

Friday, March 22, 2013

OpenAFS on Solaris 11 x86

Two days ago I presented at the UK Solaris SIG meeting on running OpenAFS on Solaris 11 x86. This is essentially the same talk I gave last year in Edinburgh, I just added few slides explaining what OpenAFS is.

Tuesday, March 19, 2013

-ck hacking: Other schedulers? Illumos?

Friday, March 08, 2013

ZFS Write Performance

Interesting blog entry from guys at Delphix on ZFS write performance - impact of fragmentation.

Tuesday, March 05, 2013

ZFS: no-op overwrites

There is an interesting new feature in ZFS in Illumos.

https://www.illumos.org/issues/3236

When overwriting a block which is check summed with a cryptographically secure hash function we can compare the old and new checksums for the block to determine if they differ (at almost no cost since we were going to do the checksums anyway). If they do not differ we don't actually need to do the write. This:
1) Reduces I/O
2) Reduces space usage, because if the old block is referenced by a snapshot we will need to keep both copies of the block around even though they contain the same data.
This functionality is only enabled if:
1) The old and new blocks are checksummed using the same algorithm.
2) That algorithm is cryptographically secure (e.g. sha256)
3) Compression is enabled on that block.

Philosophical question - should we just trust sha256?

(it seems this can't be disabled nor there is an option similar to verify=on in dedup).

There are more interesting new zfs features in Illumos (for example this one which does a similar thing to what Solaris 11 does). The only regret is that unless one wants to play with one of the appliances based on Illumos the only way to use these features is to use FreeBSD or Linux, which is rather ironic. But on the other hand - why not? At least at home.

Friday, January 04, 2013

Whose bug is this anyway?!?

Interesting post on bug fixing.