Recently I was investigating an issue where OpenAFS server processes where crashing on start-up if ASLR (Address Space Layout Randomization) is enabled. All of them were crashing in the same place.
Initially I enabled ASLR globally and restarted AFS services:
$ sxadm enable -c model=all aslr $ svcadm restart ms/afs/server
This resulted in core files from all daemons, let's look at one of them:
$ mdb core.fileserver.1407857175 Loading modules: [ libc.so.1 libnvpair.so.1 ld.so.1 ] > ::status debugging core file of fileserver (64-bit) from XXXX file: /usr/afs/bin/fileserver initial argv: /usr/afs/bin/fileserver -p 123 -pctspare 20 -L -busyat 50 -rxpck 2000 -rxbind threading model: native threads status: process terminated by SIGSEGV (Segmentation Fault), addr=ffffffffb7a94b20 > ::stack libc.so.1`memset+0x31c() afsconf_OpenInternal+0x965() afsconf_Open+0x125() main+0x118() 0x44611c() >
All the other daemons crashed in the same place.
Let's take a closer look at the core.
> afsconf_OpenInternal+0x965::dis afsconf_OpenInternal+0x930: movl $0x0,0xfffffffffffffce4(%rbp) afsconf_OpenInternal+0x93a: movq 0xfffffffffffffce0(%rbp),%r9 afsconf_OpenInternal+0x941: movq -0x8(%rbp),%r8 afsconf_OpenInternal+0x945: movq %r9,0x18(%r8) afsconf_OpenInternal+0x949: movq -0x8(%rbp),%rdi afsconf_OpenInternal+0x94d: movl $0x0,%eax afsconf_OpenInternal+0x952: call +0x1619afsconf_OpenInternal+0x957: movq -0x8(%rbp),%rdi afsconf_OpenInternal+0x95b: movl $0x0,%eax afsconf_OpenInternal+0x960: call +0x8ecb <_afsconf_loadrealms> afsconf_OpenInternal+0x965: movl $0x0,-0x24(%rbp) afsconf_OpenInternal+0x96c: jmp +0x6 afsconf_OpenInternal+0x96e: nop afsconf_OpenInternal+0x970: jmp +0x2 afsconf_OpenInternal+0x972: nop afsconf_OpenInternal+0x974: movl -0x24(%rbp),%eax afsconf_OpenInternal+0x977: leave afsconf_OpenInternal+0x978: ret 0x4e4349: nop 0x4e434c: nop ParseHostLine: pushq %rbp >
It actually crashes in _afsconf_LoadRealms(), we need a little bit more debug info:
$ truss -u a.out -u :: -vall /usr/afs/bin/fileserver $args ... /1: -> _afsconf_LoadRealms(0x831790290, 0x1, 0x3, 0x0, 0x5bbbe8, 0x8317952bc) /1: -> libc:malloc(0x28, 0x1, 0x3, 0x0, 0x28, 0x8317952bc) ... /1: <- -="" 0x8317965c0="" libc:malloc=""> libc:memset(0x317965c0, 0x0, 0x28, 0x0, 0x28, 0x8317952bc) /1: Incurred fault #6, FLTBOUNDS %pc = 0x7FFD55C802CC /1: siginfo: SIGSEGV SEGV_MAPERR addr=0x317965C0 /1: Received signal #11, SIGSEGV [default] /1: siginfo: SIGSEGV SEGV_MAPERR addr=0x317965C0 ->
It fails just after first malloc() followed by memset() in _afsconf_LoadRealms(), the corresponding source code is:
local_realms = malloc(sizeof(struct afsconf_realms)); if (!local_realms) { code = ENOMEM; goto cleanup; } memset(local_realms, 0, sizeof(struct afsconf_realms)); queue_Init(&local_realms->list);
The code looks fine... but notice in the above truss output that memset() is using a different pointer to what malloc() returned. Might be a bug in truss but since this is where it crashes it is probably real. Let's confirm it with other tool and maybe we can also spot some pattern.
$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \ -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \ -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \ -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \ -c "/usr/afs/bin/fileserver $args" CPU ID FUNCTION:NAME 3 99435 malloc:entry 40 3 99437 malloc:return 54, c62324a50 3 99433 memset:entry 62324a50 0 40 $ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \ -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \ -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \ -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \ -c "/usr/afs/bin/fileserver $args" CPU ID FUNCTION:NAME 3 99435 malloc:entry 40 3 99437 malloc:return 54, 10288d120 3 99433 memset:entry 288d120 0 40 $ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \ -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \ -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \ -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \ -c "/usr/afs/bin/fileserver $args" CPU ID FUNCTION:NAME 3 99435 malloc:entry 40 3 99437 malloc:return 54, de9479a10 3 99433 memset:entry ffffffffe9479a10 0 40
It looks like the lowest 4 bytes in the pointer returned from malloc() and passed to memset() are always preserved, while the top 4 bytes are mangled. I was curious how it looks like when ASLR is disabled:
$ elfedit -e 'dyn:sunw_aslr disable' /usr/afs/bin/fileserver $ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \ -n 'pid$target::_afsconf_LoadRealms:return{self->in=0}' \ -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \ -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \ -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \ -c "/usr/afs/bin/fileserver $args" 1 99436 malloc:entry 40 1 99438 malloc:return 54, 5bd170 1 99434 memset:entry 5bd170 0 40 ... [ it continues as it doesn't crash ]
Now the pointer passed to memset() is the same as the one returned from malloc() - notice however that it is 32bit (all daemons are compiled as 64bit). Let's have a look at the core again where it actually fails:
_afsconf_LoadRealms+0x59: call -0xa7c36_afsconf_LoadRealms+0x5e: movl %eax,%eax _afsconf_LoadRealms+0x60: cltq _afsconf_LoadRealms+0x62: movq %rax,%r8 _afsconf_LoadRealms+0x65: movq %r8,-0x20(%rbp) _afsconf_LoadRealms+0x69: movq -0x20(%rbp),%r8 _afsconf_LoadRealms+0x6d: cmpq $0x0,%r8 _afsconf_LoadRealms+0x71: jne +0xd <_afsconf_loadrealms x80=""> _afsconf_LoadRealms+0x73: movl $0xc,-0x18(%rbp) _afsconf_LoadRealms+0x7a: jmp +0x1f5 <_afsconf_loadrealms x274=""> _afsconf_LoadRealms+0x7f: nop _afsconf_LoadRealms+0x80: movl $0x28,-0x48(%rbp) _afsconf_LoadRealms+0x87: movl $0x0,-0x44(%rbp) _afsconf_LoadRealms+0x8e: movq -0x48(%rbp),%r8 _afsconf_LoadRealms+0x92: movq %r8,%rdx _afsconf_LoadRealms+0x95: movl $0x0,%esi _afsconf_LoadRealms+0x9a: movq -0x20(%rbp),%rdi _afsconf_LoadRealms+0x9e: movl $0x0,%eax _afsconf_LoadRealms+0xa3: call -0xa7d10
Bingo! See the movl and cltq instructions just after returning from malloc(). This means that malloc() is returning a 64bit address but compiler thinks it returns a 32bit address, so it clears the top 4 bytes and then expands the pointer back to 64 bits and this is what is being passed to memset(). With ASLR disabled it just happens we get a low address that the lowest 4 bytes are enough to address it so we don't get the issue, with ASLR most of the time we end up with much higher address where you can't just chop of the top four bytes.
Compilers do it if they have an implicit function declaration and then they assume the return is an int which on x86_64 means 32 bits. The fix was trivial - all that was required was to add #include <stdlib.h> and recompile - now compiler knows that malloc() returns 64 bit pointer, the movl, cltq instructions are gone and we get no more crashes.
No comments:
Post a Comment