Saturday, December 13, 2014

ZFS: RAID-Z Resilvering

Solaris 11.2 introduced a new ZFS pool version: 35 Sequential resilver.

The new feature is supposed to make disk resilvering (disk replacement, hot-spare synchronization, etc.) much faster. It achieves it by reading ahead some meta data first and then by trying to read the data to be resilvered in a sequential manner. And it does work!

Here is a real world case, with real data - over 150mln different sized files, most relatively small. Many of them were deleted, new were written, etc. so I expect that the data is already fragmented in the pool. The server is Sun/Oracle x4-2l with 26x 1.2TB 2.5" 10k SAS disks. The 24 disks in front are presented in a pass-thru mode and managed by ZFS, configured as 3 RAID-Z pools, the other 2 disks in rear are configured in RADI-1 in the raid controller and used for OS. A disk in one of the pools failed, and hot-spare automatically attached:

# zpool status -x
  pool: XXXXXXXXXXXXXXXXXXXXXXX-0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Fri Dec 12 21:02:58 2014
    3.60T scanned
    45.9G resilvered at 342M/s, 9.96% done, 2h45m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        XXXXXXXXXXXXXXXXXXXXXXX-0    DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            spare-0                  DEGRADED     0     0     0
              c0t5000CCA01D5EAE50d0  UNAVAIL      0     0     0
              c0t5000CCA01D5EED34d0  DEGRADED     0     0     0  (resilvering)
            c0t5000CCA01D5BF56Cd0    ONLINE       0     0     0
            c0t5000CCA01D5E91B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F9B00d0    ONLINE       0     0     0
            c0t5000CCA01D5E87E4d0    ONLINE       0     0     0
            c0t5000CCA01D5E95B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F8244d0    ONLINE       0     0     0
            c0t5000CCA01D58B3A4d0    ONLINE       0     0     0
        spares
          c0t5000CCA01D5EED34d0      INUSE
          c0t5000CCA01D5E1E3Cd0      AVAIL

errors: No known data errors

Let's see I/O statistics for the involved disks:

# iostat -xnC 1 | egrep "device| c0$|c0t5000CCA01D5EAE50d0|c0t5000CCA01D5EED34d0..."
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 16651.6  503.9 478461.6 69423.4  0.2 26.3    0.0    1.5   1  19 c0
 2608.5    0.0 70280.3    0.0  0.0  1.6    0.0    0.6   3  36 c0t5000CCA01D5E95B0d0
 2582.5    0.0 66708.5    0.0  0.0  1.9    0.0    0.7   3  39 c0t5000CCA01D5F9B00d0
 2272.6    0.0 68571.0    0.0  0.0  2.9    0.0    1.3   2  50 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  503.9    0.0 69423.8  0.0  9.7    0.0   19.3   2 100 c0t5000CCA01D5EED34d0
 2503.5    0.0 66508.4    0.0  0.0  2.0    0.0    0.8   3  41 c0t5000CCA01D58B3A4d0
 2324.5    0.0 67093.8    0.0  0.0  2.1    0.0    0.9   3  44 c0t5000CCA01D5F8244d0
 2285.5    0.0 69192.3    0.0  0.0  2.3    0.0    1.0   2  45 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 1997.6    0.0 70006.0    0.0  0.0  3.3    0.0    1.6   2  54 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 25150.8  624.9 499295.4 73559.8  0.2 33.7    0.0    1.3   1  22 c0
 3436.4    0.0 68455.3    0.0  0.0  3.3    0.0    0.9   2  51 c0t5000CCA01D5E95B0d0
 3477.4    0.0 71893.7    0.0  0.0  3.0    0.0    0.9   3  48 c0t5000CCA01D5F9B00d0
 3784.4    0.0 72370.6    0.0  0.0  3.6    0.0    0.9   3  56 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  624.9    0.0 73559.8  0.0  9.4    0.0   15.1   2 100 c0t5000CCA01D5EED34d0
 3170.5    0.0 72167.9    0.0  0.0  3.5    0.0    1.1   2  55 c0t5000CCA01D58B3A4d0
 3881.4    0.0 72870.8    0.0  0.0  3.3    0.0    0.8   3  55 c0t5000CCA01D5F8244d0
 4252.3    0.0 70709.1    0.0  0.0  3.2    0.0    0.8   3  53 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 3063.5    0.0 70380.1    0.0  0.0  4.0    0.0    1.3   2  60 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 17190.2  523.6 502346.2 56121.6  0.2 31.0    0.0    1.8   1  18 c0
 2342.7    0.0 71913.8    0.0  0.0  2.9    0.0    1.2   3  43 c0t5000CCA01D5E95B0d0
 2306.7    0.0 72312.9    0.0  0.0  3.0    0.0    1.3   3  43 c0t5000CCA01D5F9B00d0
 2642.1    0.0 68822.9    0.0  0.0  2.9    0.0    1.1   3  45 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  523.6    0.0 56121.2  0.0  9.3    0.0   17.8   1 100 c0t5000CCA01D5EED34d0
 2257.7    0.0 71946.9    0.0  0.0  3.2    0.0    1.4   2  44 c0t5000CCA01D58B3A4d0
 2668.2    0.0 72685.4    0.0  0.0  2.9    0.0    1.1   3  43 c0t5000CCA01D5F8244d0
 2236.6    0.0 71829.5    0.0  0.0  3.3    0.0    1.5   3  47 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 2695.2    0.0 72395.4    0.0  0.0  3.2    0.0    1.2   3  45 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 31265.3  578.9 342935.3 53825.1  0.2 18.3    0.0    0.6   1  15 c0
 3748.0    0.0 48255.8    0.0  0.0  1.5    0.0    0.4   2  42 c0t5000CCA01D5E95B0d0
 4367.0    0.0 47278.2    0.0  0.0  1.1    0.0    0.3   2  35 c0t5000CCA01D5F9B00d0
 4706.1    0.0 50982.6    0.0  0.0  1.3    0.0    0.3   3  37 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  578.9    0.0 53824.8  0.0  9.7    0.0   16.8   1 100 c0t5000CCA01D5EED34d0
 4094.1    0.0 48077.3    0.0  0.0  1.2    0.0    0.3   2  35 c0t5000CCA01D58B3A4d0
 5030.1    0.0 47700.1    0.0  0.0  0.9    0.0    0.2   3  33 c0t5000CCA01D5F8244d0
 4939.9    0.0 52671.2    0.0  0.0  1.1    0.0    0.2   3  33 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 4380.1    0.0 47969.9    0.0  0.0  1.4    0.0    0.3   3  36 c0t5000CCA01D5BF56Cd0
^C

These are pretty amazing numbers for RAID-Z - and the only reason why a single disk drive can do so many thousands reads per second is that most of them have to be very almost ideally sequential. From time to time I see even more amazing numbers: 

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 73503.1 3874.0 53807.0 19166.6  0.3  9.8    0.0    0.1   1  16 c0
 9534.8    0.0 6859.5    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5E95B0d0
 9475.7    0.0 6969.1    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5F9B00d0
 9646.9    0.0 7176.4    0.0  0.0  0.4    0.0    0.0   3  31 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0 3478.6    0.0 18040.0  0.0  5.1    0.0    1.5   2  98 c0t5000CCA01D5EED34d0
 8213.4    0.0 6908.0    0.0  0.0  0.8    0.0    0.1   3  38 c0t5000CCA01D58B3A4d0
 9671.9    0.0 6860.5    0.0  0.0  0.4    0.0    0.0   3  30 c0t5000CCA01D5F8244d0
 8572.7    0.0 6830.0    0.0  0.0  0.7    0.0    0.1   3  35 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 18387.8    0.0 12203.5    0.0  0.1  0.7    0.0    0.0   7  57 c0t5000CCA01D5BF56Cd0

It is really good to see the new feature work so well in practice. This feature is what makes RAID-Z much more usable in many production environments. The other feature which complements this one and also makes RAID-Z much more practical to use is: RAID-Z/mirror hybrid allocator introduced in Solaris 11 (pool version 29). It makes accessing meta-data in RAID-Z much faster.

Both features are only available in Oracle Solaris 11 and not in OpenZFS deriviates.
Although OpenZFS has its own interesting new features as well.

Friday, December 05, 2014

ZFS Performance Improvements

One of the blogs I value is Roch Bourbonnais's. He hasn't blogged much in some time but he is back to blogging! He listed the main ZFS performance improvements since Oracle took over, and then also provided more details on ReARC. There is more to come from him, hopefully soon.

ps. to make it clear he is describing improvements to Oracle's ZFS as in Solaris 11 and ZFS-SA and not OpenZFS.

Friday, November 14, 2014

Kernel Zones Internals

There is a very good presentation describing how Kernel Zones are implemented from an architecture point of view.

Kernel Zones: root disk recovery

I've been playing with Solaris Kernel Zones recently. I installed a zone named kz1 on an iscsi device so I could test Live Migration as well. At some point I wanted to modify contents of the kz1 root disk outside of the zone, so I imported the pool in a global zone, of course while kz1 was shut down. This worked fine (zpool import -t ...). However the zone crashed on boot as it couldn't import the pool. The reason is that when a pool is imported on another system it updates phys_path in its ZFS label and when kernel zone boots it tries to import its root pool based on phys_path which now might be invalid, which was exactly the case for me. The root disk of kz1 zone was configured as: 

add device
set storage=iscsi://iscsi1.test/lunme.naa.600144f0a613c900000054521d550001
set bootpri=0
set id=0
end

This results in phys_path: /zvnex/zvblk@0:b as a disk driver is virtualized in Kernel Zones. After the pool was imported in a global zone the phys_path was updated to: /scsi_vhci/disk@g600144f0a613c900000054521d550001:b which won't work in the kernel zone. One way to workaround the issue is to create another "rescue" zone with its root disk's id set to 1. Then add the root disk of kz1 to it as an additional non-bootable disk with id=0.
 
add device
set storage=dev:zvol/dsk/rpool/kz2
set bootpri=0
set id=1
end

add device
set storage=iscsi://iscsi1.test/lunme.naa.600144f0a613c900000054521d550001
set id=0
end

Now in order to update the phys_path to the correct one, the pool needs to be imported and exported in the rescue zone:
 
# zpool import -R /mnt/1 -Nf -t kz1 rpool
# zpool export kz1

Notice that the rescue zone doesn't even need to have a network configured - in fact all you need is a minimal OS installation in it with almost everything disabled and you login to it via zlogin.

The kz1 zone will now boot just fine. In most cases you don't need to use the above procedure.You should be able to do all the customizations via AI manifests and system profiles and SMF. But in some cases it is useful to be able to manipulate root disk contents of a kernel zone without actually booting it. Or perhaps you need to recover rpool after it became unbootable.

Monday, November 03, 2014

Rollback ZFS volume exported over iSCSI

While playing with Kernel Zones on Solaris 11.2 I noticed that once a ZFS volume is shared over iSCSI, while I can create a snapshot of it, I can't roll it back - I get "volume is busy" error message. I found a way to do it:

# stmfadm delete-lu 600144F0A613C900000054521D550001
# zfs rollback pool/idisk0@snap1
# stmfadm import-lu /dev/zvol/rdsk/pool/idisk0
Logical unit imported: 600144F0A613C900000054521D550001
# stmfadm add-view 600144F0A613C900000054521D550001

Although this should be easier...

The iSCSI lun contains a kernel zone image just after it was installed. However now I couldn't boot into it:

# zoneadm -z kz1 attach
zone 'kz1': error: Encryption key not available. See solaris-kz(5) for configuration migration
zone 'kz1': procedure or restore /etc/zones/keys/kz1.

Right, the man page explains it all - there is a host meta data needed to boot a kernel zone, which is encrypted. Since I rolled back the ZFS volume to a previous installation the encryption key stored in the zone's configuration is no longer valid. I had to re-create it:

# zoneadm -z kz1 attach -x initialize-hostdata
# zoneadm -z kz1 boot

And now it booted just fine.

Tuesday, October 28, 2014

Oracle Open World 2014

This was my first time attending OOW and I must say I did like it. What a big marketing extravaganza...! It wasn't all marketing though, there were many interesting technical sessions too, but most importantly it was a big opportunity to meet with Oracle engineers and management, both the people I met in the past and some new faces I only exchanged emails with. It was also interesting to talk to other customers and see what they've been up to, and some of them are doing very interesting things. On Sunday there was also a Solaris Customer Advisory Board meeting which was very interesting.

One of the things that surprised me was how many other vendors were present there, having their stands - virtually everyone, from startups to large tier one vendors. I guess it is a good opportunity for everyone to meet with their customers (and potentially new customers).

I also presented there on how we use Solaris and took part of a Solaris Customer Panel - both were good experiences.

For more details see Markus Flierl post.

Friday, August 29, 2014

How ASLR exposed an old bug

Recently I was investigating an issue where OpenAFS server processes where crashing on start-up if ASLR (Address Space Layout Randomization) is enabled. All of them were crashing in the same place. Initially I enabled ASLR globally and restarted AFS services:
$ sxadm enable -c model=all aslr
$ svcadm restart ms/afs/server
This resulted in core files from all daemons, let's look at one of them:
$ mdb core.fileserver.1407857175
Loading modules: [ libc.so.1 libnvpair.so.1 ld.so.1 ]
> ::status
debugging core file of fileserver (64-bit) from XXXX
file: /usr/afs/bin/fileserver
initial argv: /usr/afs/bin/fileserver -p 123 -pctspare 20 -L -busyat 50 -rxpck 2000 -rxbind
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=ffffffffb7a94b20
> ::stack
libc.so.1`memset+0x31c()
afsconf_OpenInternal+0x965()
afsconf_Open+0x125()
main+0x118()
0x44611c()
>
All the other daemons crashed in the same place. Let's take a closer look at the core.
> afsconf_OpenInternal+0x965::dis
afsconf_OpenInternal+0x930: movl $0x0,0xfffffffffffffce4(%rbp)
afsconf_OpenInternal+0x93a: movq 0xfffffffffffffce0(%rbp),%r9
afsconf_OpenInternal+0x941: movq -0x8(%rbp),%r8
afsconf_OpenInternal+0x945: movq %r9,0x18(%r8)
afsconf_OpenInternal+0x949: movq -0x8(%rbp),%rdi
afsconf_OpenInternal+0x94d: movl $0x0,%eax
afsconf_OpenInternal+0x952: call +0x1619 
afsconf_OpenInternal+0x957: movq -0x8(%rbp),%rdi
afsconf_OpenInternal+0x95b: movl $0x0,%eax
afsconf_OpenInternal+0x960: call +0x8ecb <_afsconf_loadrealms>
afsconf_OpenInternal+0x965: movl $0x0,-0x24(%rbp)
afsconf_OpenInternal+0x96c: jmp +0x6 
afsconf_OpenInternal+0x96e: nop
afsconf_OpenInternal+0x970: jmp +0x2 
afsconf_OpenInternal+0x972: nop
afsconf_OpenInternal+0x974: movl -0x24(%rbp),%eax
afsconf_OpenInternal+0x977: leave
afsconf_OpenInternal+0x978: ret
0x4e4349: nop
0x4e434c: nop
ParseHostLine: pushq %rbp
>
It actually crashes in _afsconf_LoadRealms(), we need a little bit more debug info:
$ truss -u a.out -u :: -vall /usr/afs/bin/fileserver $args
...
/1: -> _afsconf_LoadRealms(0x831790290, 0x1, 0x3, 0x0, 0x5bbbe8, 0x8317952bc)
/1: -> libc:malloc(0x28, 0x1, 0x3, 0x0, 0x28, 0x8317952bc)
...
/1: <- -="" 0x8317965c0="" libc:malloc=""> libc:memset(0x317965c0, 0x0, 0x28, 0x0, 0x28, 0x8317952bc)
/1: Incurred fault #6, FLTBOUNDS %pc = 0x7FFD55C802CC
/1: siginfo: SIGSEGV SEGV_MAPERR addr=0x317965C0
/1: Received signal #11, SIGSEGV [default]
/1: siginfo: SIGSEGV SEGV_MAPERR addr=0x317965C0
It fails just after first malloc() followed by memset() in _afsconf_LoadRealms(), the corresponding source code is:
local_realms = malloc(sizeof(struct afsconf_realms));
if (!local_realms) {
code = ENOMEM;
goto cleanup;
}
memset(local_realms, 0, sizeof(struct afsconf_realms));
queue_Init(&local_realms->list);
The code looks fine... but notice in the above truss output that memset() is using a different pointer to what malloc() returned. Might be a bug in truss but since this is where it crashes it is probably real. Let's confirm it with other tool and maybe we can also spot some pattern.
$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, c62324a50
3 99433 memset:entry 62324a50 0 40

$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, 10288d120
3 99433 memset:entry 288d120 0 40

$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, de9479a10
3 99433 memset:entry ffffffffe9479a10 0 40
It looks like the lowest 4 bytes in the pointer returned from malloc() and passed to memset() are always preserved, while the top 4 bytes are mangled. I was curious how it looks like when ASLR is disabled:
$ elfedit -e 'dyn:sunw_aslr disable' /usr/afs/bin/fileserver
$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::_afsconf_LoadRealms:return{self->in=0}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

1 99436 malloc:entry 40
1 99438 malloc:return 54, 5bd170
1 99434 memset:entry 5bd170 0 40
... [ it continues as it doesn't crash ]
Now the pointer passed to memset() is the same as the one returned from malloc() - notice however that it is 32bit (all daemons are compiled as 64bit). Let's have a look at the core again where it actually fails:
_afsconf_LoadRealms+0x59: call -0xa7c36 
_afsconf_LoadRealms+0x5e: movl %eax,%eax
_afsconf_LoadRealms+0x60: cltq
_afsconf_LoadRealms+0x62: movq %rax,%r8
_afsconf_LoadRealms+0x65: movq %r8,-0x20(%rbp)
_afsconf_LoadRealms+0x69: movq -0x20(%rbp),%r8
_afsconf_LoadRealms+0x6d: cmpq $0x0,%r8
_afsconf_LoadRealms+0x71: jne +0xd <_afsconf_loadrealms x80="">
_afsconf_LoadRealms+0x73: movl $0xc,-0x18(%rbp)
_afsconf_LoadRealms+0x7a: jmp +0x1f5 <_afsconf_loadrealms x274="">
_afsconf_LoadRealms+0x7f: nop
_afsconf_LoadRealms+0x80: movl $0x28,-0x48(%rbp)
_afsconf_LoadRealms+0x87: movl $0x0,-0x44(%rbp)
_afsconf_LoadRealms+0x8e: movq -0x48(%rbp),%r8
_afsconf_LoadRealms+0x92: movq %r8,%rdx
_afsconf_LoadRealms+0x95: movl $0x0,%esi
_afsconf_LoadRealms+0x9a: movq -0x20(%rbp),%rdi
_afsconf_LoadRealms+0x9e: movl $0x0,%eax
_afsconf_LoadRealms+0xa3: call -0xa7d10 

Bingo! See the movl and cltq instructions just after returning from malloc(). This means that malloc() is returning a 64bit address but compiler thinks it returns a 32bit address, so it clears the top 4 bytes and then expands the pointer back to 64 bits and this is what is being passed to memset(). With ASLR disabled it just happens we get a low address that the lowest 4 bytes are enough to address it so we don't get the issue, with ASLR most of the time we end up with much higher address where you can't just chop of the top four bytes.

Compilers do it if they have an implicit function declaration and then they assume the return is an int which on x86_64 means 32 bits. The fix was trivial - all that was required was to add #include <stdlib.h> and recompile - now compiler knows that malloc() returns 64 bit pointer, the movl, cltq instructions are gone and we get no more crashes.

Friday, August 08, 2014

Monitoring Dropped UDP packets

I needed to monitor incoming UDP packets being dropped by kernel and report some basic information on each of them (src and dst IP:port is enough). There are multiple probes which one could us, particularly ip:::drop-in is the best candidate, however it is not fully implemented so it doesn't provide all the same arguments other probes in IP provider have (this seems easy to fix). We can intercept ip_drop_input() function though (which is where the ip::drop-in probe is located as well) and juggle pointers to get what we want - once the drop-in probe is fixed the below script will be shorter and easier to understand. 
# ./udp_dropped_packets.d
2014 Aug  6 14:32:14 dropped UDP packet [udpIfStatsInOverflows] from 10.10.1.4:52537 to 10.10.2.9:7001
2014 Aug  6 14:32:15 dropped UDP packet [udpIfStatsInOverflows] from 10.10.1.4:52537 to 10.10.2.9:7001
2014 Aug  6 14:32:17 dropped UDP packet [l2 multicast not forwarded] from 10.10.3.11:137 to 10.10.3.9:137

Notice that the script prints reason why a packet was dropped, if you are interested only in drops due to lack of buffer space then you need to grep the output for udpIfStatsInOverflows.

The script has only been tested on Solaris 11.1 and 11.2 but it probably works on 11.0 and Illumos based distros as well.

# cat udp_dropped_packets.d
#!/usr/sbin/dtrace -qa

/* Robert Milkowski */

fbt::ip_drop_input:entry
/ ((ipha_t *)args[1]->b_rptr)->ipha_protocol == IPPROTO_UDP /
{
  this->ipha = (ipha_t *)args[1]->b_rptr;
  this->ip_src = inet_ntoa(&this->ipha->ipha_src);
  this->ip_dst = inet_ntoa(&this->ipha->ipha_dst);

  this->iph_hdr_length = (int)(this->ipha->ipha_version_and_hdr_length & 0xF) << 2;
  this->udpha = (udpha_t *)((char *)this->ipha + this->iph_hdr_length);
  this->src_port = ntohs(this->udpha->uha_src_port);
  this->dst_port = ntohs(this->udpha->uha_dst_port);

  printf("%Y dropped UDP packed [%s] from %s:%d to %s:%d\n", walltimestamp,
          stringof arg0,
          this->ip_src, this->src_port,
          this->ip_dst, this->dst_port
        );
}

Tuesday, July 22, 2014

Massive msync() speed up on ZFS

MongoDB is using mmap() to access all its files. It also has a special thread which by default wakes up every 60s and calls msync() for all the mmap'ed files, one at a time. Initially when you start MongoDB all these msync's are fast, assuming there are no modifications. However if your server has hundreds of GBs of RAM (or more) and a database is also that big, all these msync()s are getting slower over time - they take longer the more data is cached in RAM. Eventually it can take even 50s or more for the thread to finish syncing all of the files even if there is nothing to write to disk. The actual problem is that while the thread is syncing all the files it holds a global lock and until it finishes the database is almost useless. If it takes 50s to sync all the files then the database can process requests only for 10s out of each 60s window...

If you have logging enabled in MongoDB you should see log entries like:

    Tue Apr 29 06:22:02.901 [DataFileSync] flushing mmaps took 33302ms for 376 files

On Linux this is much faster as it has a special optimization for such a case, which Solaris doesn't.
However Oracle fixed the bug sometime ago and now the same database reports:

    Tue Apr 29 12:55:51.009 [DataFileSync] flushing mmaps took 9ms for 378 files

This is over 3000x improvement!
 
The Solaris Bug ID is: 18658199 Speed up msync() on ZFS by 90000x with this one weird trick
which is fixed in Solaris 11.1 SRU21 and also Solaris 11.2

Note that the fix only improves a case when an entire file is msync'ed and the underlying file system is ZFS. Any application which has a similar behavior would benefit. For some large mappings, like 1TB, the improvement can even be 178,000x.


Wednesday, April 09, 2014

Slow process core dump

Solaris limits rate at which a process core dump is generated. There are two tunables with default values: core_delay_usec=10000 core_chunk=32 (in pages). This essentially means that process core dump is limited to 100 writes of 128KB per second (32x 4k pages on x86) which is an eqivalen of 12.5MB/s - for large processes this will take quite a long time.

I set core_chunk=128 and core_delay_usec=0 via mdb which increased the core dump write rate from 12.5MB/s to about 300-600MB/s.

Long core dump generations can cause issues like delay in restarting a service in SMF. But then making it too fast might overwhelm OS and impact other running processes there. Still, I think the defaults are probably too conservative.

update: I just noticed that this has been fixed in Illumos, see https://illumos.org/issues/3673
 

Thursday, April 03, 2014

Home NAS - Data Corruption

I value my personal pictures so I store them on a home NAS which is running Solaris/ZFS. The server has two data disks mirrored. It's been running perfectly fine for the last two years but recently one of the disk drives returned corrupted data during ZFS scrub. If it wasn't for ZFS I probably wouldn't even know that some pixels are wrong... or perhaps that some pictures are missing... This is exactly why I've used ZFS on my home NAS.

# zpool status vault
  pool: vault
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 256K in 3h54m with 0 errors on Tue Mar 11 17:34:22 2014
config:

        NAME        STATE     READ WRITE CKSUM
        vault       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     2
            c5t0d0  ONLINE       0     0     0

errors: No known data errors