Friday, November 14, 2014

Kernel Zones Internals

There is a very good presentation describing how Kernel Zones are implemented from an architecture point of view.

Kernel Zones: root disk recovery

I've been playing with Solaris Kernel Zones recently. I installed a zone named kz1 on an iscsi device so I could test Live Migration as well. At some point I wanted to modify contents of the kz1 root disk outside of the zone, so I imported the pool in a global zone, of course while kz1 was shut down. This worked fine (zpool import -t ...). However the zone crashed on boot as it couldn't import the pool. The reason is that when a pool is imported on another system it updates phys_path in its ZFS label and when kernel zone boots it tries to import its root pool based on phys_path which now might be invalid, which was exactly the case for me. The root disk of kz1 zone was configured as: 

add device
set storage=iscsi://iscsi1.test/lunme.naa.600144f0a613c900000054521d550001
set bootpri=0
set id=0
end

This results in phys_path: /zvnex/zvblk@0:b as a disk driver is virtualized in Kernel Zones. After the pool was imported in a global zone the phys_path was updated to: /scsi_vhci/disk@g600144f0a613c900000054521d550001:b which won't work in the kernel zone. One way to workaround the issue is to create another "rescue" zone with its root disk's id set to 1. Then add the root disk of kz1 to it as an additional non-bootable disk with id=0.
 
add device
set storage=dev:zvol/dsk/rpool/kz2
set bootpri=0
set id=1
end

add device
set storage=iscsi://iscsi1.test/lunme.naa.600144f0a613c900000054521d550001
set id=0
end

Now in order to update the phys_path to the correct one, the pool needs to be imported and exported in the rescue zone:
 
# zpool import -R /mnt/1 -Nf -t kz1 rpool
# zpool export kz1

Notice that the rescue zone doesn't even need to have a network configured - in fact all you need is a minimal OS installation in it with almost everything disabled and you login to it via zlogin.

The kz1 zone will now boot just fine. In most cases you don't need to use the above procedure.You should be able to do all the customizations via AI manifests and system profiles and SMF. But in some cases it is useful to be able to manipulate root disk contents of a kernel zone without actually booting it. Or perhaps you need to recover rpool after it became unbootable.

Monday, November 03, 2014

Rollback ZFS volume exported over iSCSI

While playing with Kernel Zones on Solaris 11.2 I noticed that once a ZFS volume is shared over iSCSI, while I can create a snapshot of it, I can't roll it back - I get "volume is busy" error message. I found a way to do it:

# stmfadm delete-lu 600144F0A613C900000054521D550001
# zfs rollback pool/idisk0@snap1
# stmfadm import-lu /dev/zvol/rdsk/pool/idisk0
Logical unit imported: 600144F0A613C900000054521D550001
# stmfadm add-view 600144F0A613C900000054521D550001

Although this should be easier...

The iSCSI lun contains a kernel zone image just after it was installed. However now I couldn't boot into it:

# zoneadm -z kz1 attach
zone 'kz1': error: Encryption key not available. See solaris-kz(5) for configuration migration
zone 'kz1': procedure or restore /etc/zones/keys/kz1.

Right, the man page explains it all - there is a host meta data needed to boot a kernel zone, which is encrypted. Since I rolled back the ZFS volume to a previous installation the encryption key stored in the zone's configuration is no longer valid. I had to re-create it:

# zoneadm -z kz1 attach -x initialize-hostdata
# zoneadm -z kz1 boot

And now it booted just fine.

Tuesday, October 28, 2014

Oracle Open World 2014

This was my first time attending OOW and I must say I did like it. What a big marketing extravaganza...! It wasn't all marketing though, there were many interesting technical sessions too, but most importantly it was a big opportunity to meet with Oracle engineers and management, both the people I met in the past and some new faces I only exchanged emails with. It was also interesting to talk to other customers and see what they've been up to, and some of them are doing very interesting things. On Sunday there was also a Solaris Customer Advisory Board meeting which was very interesting.

One of the things that surprised me was how many other vendors were present there, having their stands - virtually everyone, from startups to large tier one vendors. I guess it is a good opportunity for everyone to meet with their customers (and potentially new customers).

I also presented there on how we use Solaris and took part of a Solaris Customer Panel - both were good experiences.

For more details see Markus Flierl post.

Friday, August 29, 2014

How ASLR exposed an old bug

Recently I was investigating an issue where OpenAFS server processes where crashing on start-up if ASLR (Address Space Layout Randomization) is enabled. All of them were crashing in the same place. Initially I enabled ASLR globally and restarted AFS services:
$ sxadm enable -c model=all aslr
$ svcadm restart ms/afs/server
This resulted in core files from all daemons, let's look at one of them:
$ mdb core.fileserver.1407857175
Loading modules: [ libc.so.1 libnvpair.so.1 ld.so.1 ]
> ::status
debugging core file of fileserver (64-bit) from XXXX
file: /usr/afs/bin/fileserver
initial argv: /usr/afs/bin/fileserver -p 123 -pctspare 20 -L -busyat 50 -rxpck 2000 -rxbind
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=ffffffffb7a94b20
> ::stack
libc.so.1`memset+0x31c()
afsconf_OpenInternal+0x965()
afsconf_Open+0x125()
main+0x118()
0x44611c()
>
All the other daemons crashed in the same place. Let's take a closer look at the core.
> afsconf_OpenInternal+0x965::dis
afsconf_OpenInternal+0x930: movl $0x0,0xfffffffffffffce4(%rbp)
afsconf_OpenInternal+0x93a: movq 0xfffffffffffffce0(%rbp),%r9
afsconf_OpenInternal+0x941: movq -0x8(%rbp),%r8
afsconf_OpenInternal+0x945: movq %r9,0x18(%r8)
afsconf_OpenInternal+0x949: movq -0x8(%rbp),%rdi
afsconf_OpenInternal+0x94d: movl $0x0,%eax
afsconf_OpenInternal+0x952: call +0x1619 
afsconf_OpenInternal+0x957: movq -0x8(%rbp),%rdi
afsconf_OpenInternal+0x95b: movl $0x0,%eax
afsconf_OpenInternal+0x960: call +0x8ecb <_afsconf_loadrealms>
afsconf_OpenInternal+0x965: movl $0x0,-0x24(%rbp)
afsconf_OpenInternal+0x96c: jmp +0x6 
afsconf_OpenInternal+0x96e: nop
afsconf_OpenInternal+0x970: jmp +0x2 
afsconf_OpenInternal+0x972: nop
afsconf_OpenInternal+0x974: movl -0x24(%rbp),%eax
afsconf_OpenInternal+0x977: leave
afsconf_OpenInternal+0x978: ret
0x4e4349: nop
0x4e434c: nop
ParseHostLine: pushq %rbp
>
It actually crashes in _afsconf_LoadRealms(), we need a little bit more debug info:
$ truss -u a.out -u :: -vall /usr/afs/bin/fileserver $args
...
/1: -> _afsconf_LoadRealms(0x831790290, 0x1, 0x3, 0x0, 0x5bbbe8, 0x8317952bc)
/1: -> libc:malloc(0x28, 0x1, 0x3, 0x0, 0x28, 0x8317952bc)
...
/1: <- -="" 0x8317965c0="" libc:malloc=""> libc:memset(0x317965c0, 0x0, 0x28, 0x0, 0x28, 0x8317952bc)
/1: Incurred fault #6, FLTBOUNDS %pc = 0x7FFD55C802CC
/1: siginfo: SIGSEGV SEGV_MAPERR addr=0x317965C0
/1: Received signal #11, SIGSEGV [default]
/1: siginfo: SIGSEGV SEGV_MAPERR addr=0x317965C0
It fails just after first malloc() followed by memset() in _afsconf_LoadRealms(), the corresponding source code is:
local_realms = malloc(sizeof(struct afsconf_realms));
if (!local_realms) {
code = ENOMEM;
goto cleanup;
}
memset(local_realms, 0, sizeof(struct afsconf_realms));
queue_Init(&local_realms->list);
The code looks fine... but notice in the above truss output that memset() is using a different pointer to what malloc() returned. Might be a bug in truss but since this is where it crashes it is probably real. Let's confirm it with other tool and maybe we can also spot some pattern.
$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, c62324a50
3 99433 memset:entry 62324a50 0 40

$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, 10288d120
3 99433 memset:entry 288d120 0 40

$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, de9479a10
3 99433 memset:entry ffffffffe9479a10 0 40
It looks like the lowest 4 bytes in the pointer returned from malloc() and passed to memset() are always preserved, while the top 4 bytes are mangled. I was curious how it looks like when ASLR is disabled:
$ elfedit -e 'dyn:sunw_aslr disable' /usr/afs/bin/fileserver
$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::_afsconf_LoadRealms:return{self->in=0}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

1 99436 malloc:entry 40
1 99438 malloc:return 54, 5bd170
1 99434 memset:entry 5bd170 0 40
... [ it continues as it doesn't crash ]
Now the pointer passed to memset() is the same as the one returned from malloc() - notice however that it is 32bit (all daemons are compiled as 64bit). Let's have a look at the core again where it actually fails:
_afsconf_LoadRealms+0x59: call -0xa7c36 
_afsconf_LoadRealms+0x5e: movl %eax,%eax
_afsconf_LoadRealms+0x60: cltq
_afsconf_LoadRealms+0x62: movq %rax,%r8
_afsconf_LoadRealms+0x65: movq %r8,-0x20(%rbp)
_afsconf_LoadRealms+0x69: movq -0x20(%rbp),%r8
_afsconf_LoadRealms+0x6d: cmpq $0x0,%r8
_afsconf_LoadRealms+0x71: jne +0xd <_afsconf_loadrealms x80="">
_afsconf_LoadRealms+0x73: movl $0xc,-0x18(%rbp)
_afsconf_LoadRealms+0x7a: jmp +0x1f5 <_afsconf_loadrealms x274="">
_afsconf_LoadRealms+0x7f: nop
_afsconf_LoadRealms+0x80: movl $0x28,-0x48(%rbp)
_afsconf_LoadRealms+0x87: movl $0x0,-0x44(%rbp)
_afsconf_LoadRealms+0x8e: movq -0x48(%rbp),%r8
_afsconf_LoadRealms+0x92: movq %r8,%rdx
_afsconf_LoadRealms+0x95: movl $0x0,%esi
_afsconf_LoadRealms+0x9a: movq -0x20(%rbp),%rdi
_afsconf_LoadRealms+0x9e: movl $0x0,%eax
_afsconf_LoadRealms+0xa3: call -0xa7d10 

Bingo! See the movl and cltq instructions just after returning from malloc(). This means that malloc() is returning a 64bit address but compiler thinks it returns a 32bit address, so it clears the top 4 bytes and then expands the pointer back to 64 bits and this is what is being passed to memset(). With ASLR disabled it just happens we get a low address that the lowest 4 bytes are enough to address it so we don't get the issue, with ASLR most of the time we end up with much higher address where you can't just chop of the top four bytes.

Compilers do it if they have an implicit function declaration and then they assume the return is an int which on x86_64 means 32 bits. The fix was trivial - all that was required was to add #include <stdlib.h> and recompile - now compiler knows that malloc() returns 64 bit pointer, the movl, cltq instructions are gone and we get no more crashes.

Friday, August 08, 2014

Monitoring Dropped UDP packets

I needed to monitor incoming UDP packets being dropped by kernel and report some basic information on each of them (src and dst IP:port is enough). There are multiple probes which one could us, particularly ip:::drop-in is the best candidate, however it is not fully implemented so it doesn't provide all the same arguments other probes in IP provider have (this seems easy to fix). We can intercept ip_drop_input() function though (which is where the ip::drop-in probe is located as well) and juggle pointers to get what we want - once the drop-in probe is fixed the below script will be shorter and easier to understand. 
# ./udp_dropped_packets.d
2014 Aug  6 14:32:14 dropped UDP packet [udpIfStatsInOverflows] from 10.10.1.4:52537 to 10.10.2.9:7001
2014 Aug  6 14:32:15 dropped UDP packet [udpIfStatsInOverflows] from 10.10.1.4:52537 to 10.10.2.9:7001
2014 Aug  6 14:32:17 dropped UDP packet [l2 multicast not forwarded] from 10.10.3.11:137 to 10.10.3.9:137

Notice that the script prints reason why a packet was dropped, if you are interested only in drops due to lack of buffer space then you need to grep the output for udpIfStatsInOverflows.

The script has only been tested on Solaris 11.1 and 11.2 but it probably works on 11.0 and Illumos based distros as well.

# cat udp_dropped_packets.d
#!/usr/sbin/dtrace -qa

/* Robert Milkowski */

fbt::ip_drop_input:entry
/ ((ipha_t *)args[1]->b_rptr)->ipha_protocol == IPPROTO_UDP /
{
  this->ipha = (ipha_t *)args[1]->b_rptr;
  this->ip_src = inet_ntoa(&this->ipha->ipha_src);
  this->ip_dst = inet_ntoa(&this->ipha->ipha_dst);

  this->iph_hdr_length = (int)(this->ipha->ipha_version_and_hdr_length & 0xF) << 2;
  this->udpha = (udpha_t *)((char *)this->ipha + this->iph_hdr_length);
  this->src_port = ntohs(this->udpha->uha_src_port);
  this->dst_port = ntohs(this->udpha->uha_dst_port);

  printf("%Y dropped UDP packed [%s] from %s:%d to %s:%d\n", walltimestamp,
          stringof arg0,
          this->ip_src, this->src_port,
          this->ip_dst, this->dst_port
        );
}

Tuesday, July 22, 2014

Massive msync() speed up on ZFS

MongoDB is using mmap() to access all its files. It also has a special thread which by default wakes up every 60s and calls msync() for all the mmap'ed files, one at a time. Initially when you start MongoDB all these msync's are fast, assuming there are no modifications. However if your server has hundreds of GBs of RAM (or more) and a database is also that big, all these msync()s are getting slower over time - they take longer the more data is cached in RAM. Eventually it can take even 50s or more for the thread to finish syncing all of the files even if there is nothing to write to disk. The actual problem is that while the thread is syncing all the files it holds a global lock and until it finishes the database is almost useless. If it takes 50s to sync all the files then the database can process requests only for 10s out of each 60s window...

If you have logging enabled in MongoDB you should see log entries like:

    Tue Apr 29 06:22:02.901 [DataFileSync] flushing mmaps took 33302ms for 376 files

On Linux this is much faster as it has a special optimization for such a case, which Solaris doesn't.
However Oracle fixed the bug sometime ago and now the same database reports:

    Tue Apr 29 12:55:51.009 [DataFileSync] flushing mmaps took 9ms for 378 files

This is over 3000x improvement!
 
The Solaris Bug ID is: 18658199 Speed up msync() on ZFS by 90000x with this one weird trick
which is fixed in Solaris 11.1 SRU21 and also Solaris 11.2

Note that the fix only improves a case when an entire file is msync'ed and the underlying file system is ZFS. Any application which has a similar behavior would benefit. For some large mappings, like 1TB, the improvement can even be 178,000x.


Wednesday, April 09, 2014

Slow process core dump

Solaris limits rate at which a process core dump is generated. There are two tunables with default values: core_delay_usec=10000 core_chunk=32 (in pages). This essentially means that process core dump is limited to 100 writes of 128KB per second (32x 4k pages on x86) which is an eqivalen of 12.5MB/s - for large processes this will take quite a long time.

I set core_chunk=128 and core_delay_usec=0 via mdb which increased the core dump write rate from 12.5MB/s to about 300-600MB/s.

Long core dump generations can cause issues like delay in restarting a service in SMF. But then making it too fast might overwhelm OS and impact other running processes there. Still, I think the defaults are probably too conservative.

update: I just noticed that this has been fixed in Illumos, see https://illumos.org/issues/3673
 

Thursday, April 03, 2014

Home NAS - Data Corruption

I value my personal pictures so I store them on a home NAS which is running Solaris/ZFS. The server has two data disks mirrored. It's been running perfectly fine for the last two years but recently one of the disk drives returned corrupted data during ZFS scrub. If it wasn't for ZFS I probably wouldn't even know that some pixels are wrong... or perhaps that some pictures are missing... This is exactly why I've used ZFS on my home NAS.

# zpool status vault
  pool: vault
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 256K in 3h54m with 0 errors on Tue Mar 11 17:34:22 2014
config:

        NAME        STATE     READ WRITE CKSUM
        vault       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     2
            c5t0d0  ONLINE       0     0     0

errors: No known data errors

Friday, January 31, 2014

SMF: svcbundle

One of the missing things in SMF for me is an ability to mark a service to be disabled or enabled after reboot during postinstall phase of OS installation. One way of doing it is to generate a site profile, but this is rather cumbersome to do from a postinstaller. But there is a better way - svcbundle(1M). For example, to generate a profile which would enable ssh after reboot (or after manifest-import service is restarted:
svcbundle -s bundle-type=profile -s service-name=network/ssh \
          -s enabled=true >/etc/svc/profile/site/postinstall-ssh.xml
You can modify any parameter you want, for example to modify a sendmail service to use /etc/mail/cf/cf/nullclient.mc to generate its configuration from each time it starts:
svcbundle -s bundle-type=profile -s service-name=network/smtp -s instance-name=sendmail \
          -s instance-property-config:patht_to_sendmail_mc:astring:/etc/mail/cf/cf/nullclient.mc \
          >/etc/svc/profile/site/postinstall-sendmail-nullclient.xml

Thursday, January 23, 2014

mkdir() performance

Update: the fix is in Solaris 11.1 + SRU17, and should be in Solaris 11.2 once it is out. It now has a similar optimization to Linux. Network based file systems like AFS or NFS benefit most from it.

Recently I came across an issue where 'make install' on a Solaris server was taking *much* more time than on Linux. Files were being installed into AFS file system. After some debugging I found that GNU install calls mkdir() for all directories for a specified path and relies on EEXIST if a given directory already exists. For example: 

$ truss -D -t mkdir /usr/bin/ginstall -c -d \
 /ms/dev/openafs/core/1.6.5-c3/compile/x86_64.sunos64.5.11/sunx86_511/dest/bin
0.0026 mkdir("/ms", 0755)                              Err#17 EEXIST
0.0003 mkdir("dev", 0755)                              Err#30 EROFS
0.0002 mkdir("openafs", 0755)                          Err#30 EROFS
0.0002 mkdir("core", 0755)                             Err#30 EROFS
0.0083 mkdir("1.6.5-c3", 0755)                         Err#17 EEXIST
3.0085 mkdir("compile", 0755)                          Err#17 EEXIST
3.0089 mkdir("x86_64.sunos64.5.11", 0755)              Err#17 EEXIST
0.0005 mkdir("sunx86_511", 0755)                       Err#17 EEXIST
0.0002 mkdir("dest", 0755)                             Err#17 EEXIST
0.0065 mkdir("bin", 0755)                              Err#17 EEXIST
$

Notice that two mkdir()s took about 3s each! Now if there are lots of directories to be ginstall'ed it will take a very long time... I couldn't reproduce it on Linux though. Actually what happens is that Linux checks on VFS layer if there is a valid dentry with an inode allocated and if there is it will return with EEXIST without calling a file system specific VOP_MKDIR. The relevant code is:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/namei.c
…
int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
{
        int error = may_create(dir, dentry);
        unsigned max_links = dir->i_sb->s_max_links;

        if (error)
               return error;

        if (!dir->i_op->mkdir)
               return -EPERM;
…
static inline int may_create(struct inode *dir, struct dentry *child)
{
        audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
        if (child->d_inode)
               return -EEXIST;
…

Unfortunately, Solaris doesn't have this optimization (though it does optimize couple of other cases, for example for EROFS), so each mkdir() results in VOP_MKDIR being called and for AFS it means sending a request over a network to a file server and waiting for a reply. That alone will make it slower than on Linux, but it still doesn't explain the 3s.

It turned out that AFS file server has a throttling mechanism - if a client is generating requests which results in error then by default it will delay answering to the client after 10 errors. This can be disabled or the threshold can be adjusted though. See -abortthreshold option to file server.

This was also tested (by an Oracle Solaris engineer) over NFS and showed 100x difference in response time. There is a negligible difference for local file systems.

A bug was opened against Solaris to get it fixed - see Bug 18115102 - mkdir(2) calls VOP_MKDIR() even though the directory or file already exists

Hopefully it will get fixed soon.