# truss -f -o /tmp/a -v all -adDE -p 408Now I logged in to the system and analyzed /tmp/a file. First I confirmed that it took over 10s to login. From the moment the connection was accepted to the moment I got interactive session it took about 11s as shown below:
[...]So I checked when it started to go wrong.
408: 2.6594 0.0007 0.0000 fcntl(4, F_SETFL, (no flags)) = 0
[...]
12186: 14.0814 0.0001 0.0000 write(4, " | : { b7F S LB7A2 BA13".., 64) = 64
12196: read(0, 0x080473DF, 1) (sleeping...)
[...]
[...]The connection started just before the fcntl showed above and everything is executing quick up-to forkx() at 3.7245s. So far it took a little more than 1s. What happens next seems to be a loop of hundreds of entries like:
408: 2.6594 0.0007 0.0000 fcntl(4, F_SETFL, (no flags)) = 0
[...]
12196: 3.7245 0.0003 0.0003 forkx(0) = 12200
[...]The process with PID 12200 was:
12200: 4.5521 0.0000 0.0000 ioctl(3, ZFS_IOC_USERSPACE_ONE, 0x08046790) Err#48 ENOTSUP
12200: 4.5522 0.0001 0.0000 ioctl(7, MNTIOC_GETMNTENT, 0x08047C1C) = 0
12200: 4.5917 0.0395 0.0002 ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08046390) = 0
12200: 4.5918 0.0001 0.0000 getuid() = 35148 [35148]
12200: 4.5919 0.0001 0.0000 getuid() = 35148 [35148]
12200: 4.5919 0.0000 0.0000 door_info(6, 0x08046460) = 0
12200: target=189 proc=0x806FCD0 data=0xDEADBEED
12200: attributes=DOOR_UNREF|DOOR_NO_CANCEL
12200: uniquifier=289
12200: 4.5922 0.0003 0.0000 door_call(6, 0x080464D0) = 0
12200: data_ptr=FE430000 data_size=232
12200: desc_ptr=0x0 desc_num=0
12200: rbuf=0xFE430000 rsize=16384
12200: 4.5923 0.0001 0.0000 ioctl(3, ZFS_IOC_USERSPACE_ONE, 0x08046790) Err#48 ENOTSUP
12200: 4.5923 0.0000 0.0000 ioctl(7, MNTIOC_GETMNTENT, 0x08047C1C) = 0
12200: 4.6095 0.0172 0.0001 ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08046390) = 0
12200: 4.6096 0.0001 0.0000 getuid() = 35148 [35148]
12200: 4.6096 0.0000 0.0000 getuid() = 35148 [35148]
12200: 4.6097 0.0001 0.0000 door_info(6, 0x08046460) = 0
12200: target=189 proc=0x806FCD0 data=0xDEADBEED
12200: attributes=DOOR_UNREF|DOOR_NO_CANCEL
12200: uniquifier=289
12200: 4.6098 0.0001 0.0000 door_call(6, 0x080464D0) = 0
12200: data_ptr=FE430000 data_size=232
12200: desc_ptr=0x0 desc_num=0
12200: rbuf=0xFE430000 rsize=16384
12200: 4.6098 0.0000 0.0000 ioctl(3, ZFS_IOC_USERSPACE_ONE, 0x08046790) Err#48 ENOTSUP
12200: 4.6099 0.0001 0.0000 ioctl(7, MNTIOC_GETMNTENT, 0x08047C1C) = 0
12200: 4.6201 0.0102 0.0001 ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08046390) = 0
12200: 4.6202 0.0001 0.0000 getuid() = 35148 [35148]
12200: 4.6203 0.0001 0.0000 getuid() = 35148 [35148]
[...]
12200: 3.8229 0.0947 0.0013 execve("/usr/sbin/quota", 0x0811F9E8, 0x0811D008) argc = 1By visually looking at couple of pages of these ioctls it looked like most of the total time would be spent in doing ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08046390). Lets check it:
12200: *** SUID: ruid/euid/suid = 35148 / 0 / 0 ***
12200: argv: /usr/sbin/quota
# grep "^12200:" /tmp/a |grep ioctl|grep ZFS_IOC_OBJSET_STATS|awk 'BEGIN{i=0}{i=i+$3}END{print i}'So out of 11s above ioctls along took 9. To have a clear picture lets check how much time the quota command took:
9.7412
# grep "^12200:" /tmp/a |head -1So it took about 10s which means almost 100% of its time was spent doing above ioctls.
12200: 3.7245 3.7245 0.0000 forkx() (returning as child ...) = 12196
# grep "^12200:" /tmp/a |tail -1
12200: 13.9854 0.0003 0.0000 _exit(0)
There are almost 300 zfs filesystems on this particular server so it all adds up. Sometimes quota completes very quickly sometimes it takes many seconds - I guess depending if requested data from all these zfs filesystems is cached or not. You need to run quota as a non-root user otherwise most checks are skipped and it is always quick.
Since we are not using quota on these systems anyway I commented out quota check in /etc/profile and now a full login takes about 1s on average which is 10-12x improvement.
I normally remove quota (and mail) checks on systems that I manage, as they're invariably useless. And if I can't, then touching ~/.hushlogin can help a lot too.
ReplyDeleteOne culprit I've seen in the last week was a transcontinental NFS mount - checking quota on that at every login was just silly.
Great investigation.
ReplyDeleteI recently ran into this as well. Is this check independent of ZFS quotas?
Dru R - yes, it is independent. It is run from /etc/profile and basically checks all filesystems (ufs, nfs, zfs) for a user quota.
ReplyDeletePeter - yeah, you can comment the quota out in /etc/profile so it is disabled for all users or create a .hushlogin for each of them