Another blog entry saved as draft for a long time and waiting to be published... better late than never, so let's do it now.
Over three years ago I found a bug in Linux nfs v4.0 client which results in applications temporarily not being able to open a file and getting an error. While this is due to a bug in the nfs v4.0 client code, depending on a specific nfs server you might or might not hit the issue. For example, both Linux and Solaris NFS servers will trigger the bug, while NetApp/ONTAP won't.
Once I debuged and understood the problem and was able to reproduce it. Then I ended up with a temporary workaround and later submitted two patches to Linux kernel which quickly have been integrated into Linux 5.6 and backported to some older versions (see the patches at the very end of this blog entry).
Let's look in more detail at the issue.
btw: some output has been abbreviated and/or modified to remove irrelevant details or to anonymize details.
An application written in java was sometimes failing to open a file with the below exception:
Developers added an retry logic which helped. Usually after few retries the application was able to open the affected file fine. This was happening to different files, at different times on multiple servers. Even to files which were created long time ago (don't get fooled by the java.io.FileNotFoundException).
To understand how a file is accessed by the application (although some other applications were affected too), I wrote the below code based on the affected application.
Let's run it under strace and see what system calls it calls:
It is just a simple open() with O_RDONLY followed by read() and close(). Nothing suspicious. The error is happening with open() though, based on the java stack in the app log.
I did run the test program multiple times on affected servers and tried to access the affected files, but couldn't trigger the issue.
We modified one of the affected applications and added a retry logic for failed file opens, we also run part of the application under strace to see what's the actual error reported by open(). Let's see an example output:
The open() failed with EIO two times and then succeeded the third time.
Given these are nfs4 mount points with "hard" mount option we shouldn't be getting EIO though.
When the issue happens no errors reported in system logs.
I also collected network traffic for the above failures.
Looks like nfs session is expiring and asynchronously renewing, failing in the meantime.
We caught some errors:
Let's look at one java thread:
So within the same second it stat'ed the file multiple times just fine (1x lstat() + 3x stat()) only to get EIO on open().
Let's look at tcpdump capture:
# tcpdump -n -r /var/tmp/tcpdump.XXX.out
...
17:11:02.766890 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75637865:75638065, ack 2022930725, win 12288, options [nop,nop,TS val 12657161 ecr 997162631], length 200: NFS request xid 2489184212 196 getattr fh 0,0/22
17:11:02.767043 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022930725:2022931065, ack 75638065, win 32806, options [nop,nop,TS val 997162631 ecr 12657161], length 340: NFS reply xid 2489184212 reply ok 336 getattr NON 4 ids 0/15 sz 0
17:11:02.769335 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75638065:75638329, ack 2022931065, win 12288, options [nop,nop,TS val 12657164 ecr 997162631], length 264: NFS request xid 2505961428 260 getattr fh 0,0/22
17:11:02.769477 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022931065:2022931149, ack 75638329, win 32806, options [nop,nop,TS val 997162631 ecr 12657164], length 84: NFS reply xid 2505961428 reply ok 80 getattr ERROR: unk 10011
17:11:02.769544 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75638329:75638593, ack 2022931149, win 12288, options [nop,nop,TS val 12657164 ecr 997162631], length 264: NFS request xid 2522738644 260 getattr fh 0,0/22
17:11:02.769638 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022931149:2022931233, ack 75638593, win 32806, options [nop,nop,TS val 997162631 ecr 12657164], length 84: NFS reply xid 2522738644 reply ok 80 getattr ERROR: unk 10011
17:11:02.809052 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [.], ack 2022931233, win 12288, options [nop,nop,TS val 12657204 ecr 997162631], length 0
17:11:02.821772 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75638593:75638789, ack 2022931233, win 12288, options [nop,nop,TS val 12657216 ecr 997162631], length 196: NFS request xid 2539515860 192 getattr fh 0,0/22
17:11:02.822228 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022931233:2022931573, ack 75638789, win 32806, options [nop,nop,TS val 997162636 ecr 12657216], length 340: NFS reply xid 2539515860 reply ok 336 getattr NON 4 ids 0/15 sz 0
17:11:02.822261 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [.], ack 2022931573, win 12288, options [nop,nop,TS val 12657217 ecr 997162636], length 0
17:11:02.823327 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75638789:75638957, ack 2022931573, win 12288, options [nop,nop,TS val 12657218 ecr 997162636], length 168: NFS request xid 2556293076 164 getattr fh 0,0/22
17:11:02.823436 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022931573:2022931857, ack 75638957, win 32806, options [nop,nop,TS val 997162636 ecr 12657218], length 284: NFS reply xid 2556293076 reply ok 280 getattr NON 2 ids 0/9 sz 0
...
It correlates with "getattr ERROR: unk 10011"
Let's also look at tcpdump capture for YYY:
# tcpdump -n -r /var/tmp/tcpdump.YYY.out
...
17:10:08.447238 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [.], ack 174377189, win 11471, options [nop,nop,TS val 12602842 ecr 997157199], length 0
17:11:08.645057 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [.], ack 174377189, win 11471, options [nop,nop,TS val 12663040 ecr 997157199], length 0
17:11:08.645147 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [.], ack 314027332, win 32806, options [nop,nop,TS val 997163219 ecr 12602842], length 0
17:11:08.645173 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314027332:314027492, ack 174377189, win 11471, options [nop,nop,TS val 12663040 ecr 997163219], length 160: NFS request xid 398604749 156 getattr fh 0,28/0
17:11:08.645249 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [.], ack 314027492, win 32806, options [nop,nop,TS val 997163219 ecr 12663040], length 0
17:11:08.692687 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377189:174377305, ack 314027492, win 32806, options [nop,nop,TS val 997163223 ecr 12663040], length 116: NFS reply xid 398604749 reply ok 112 getattr ERROR: unk 24
17:11:08.692796 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314027492:314027652, ack 174377305, win 11471, options [nop,nop,TS val 12663087 ecr 997163223], length 160: NFS request xid 415381965 156 getattr fh 0,28/0
17:11:08.693123 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377305:174377421, ack 314027652, win 32806, options [nop,nop,TS val 997163223 ecr 12663087], length 116: NFS reply xid 415381965 reply ok 112 getattr ERROR: unk 24
17:11:08.693164 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314027652:314027916, ack 174377421, win 11471, options [nop,nop,TS val 12663088 ecr 997163223], length 264: NFS request xid 432159181 260 getattr fh 0,132/0
17:11:08.693592 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377421:174377553, ack 314027916, win 32806, options [nop,nop,TS val 997163223 ecr 12663088], length 132: NFS reply xid 432159181 reply ok 128 getattr ERROR: unk 40
17:11:08.693680 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314027916:314028084, ack 174377553, win 11471, options [nop,nop,TS val 12663088 ecr 997163223], length 168: NFS request xid 448936397 164 getattr fh 0,36/0
17:11:08.701988 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377553:174377669, ack 314028084, win 32806, options [nop,nop,TS val 997163224 ecr 12663088], length 116: NFS reply xid 448936397 reply ok 112 getattr ERROR: unk 24
17:11:08.702090 IP 10.210.6.5.63021 > 10.210.16.12.33663: Flags [S], seq 1851114974, win 32804, options [mss 1460,sackOK,TS val 997163224 ecr 0,nop,wscale 5], length 0
17:11:08.702122 IP 10.210.16.12.33663 > 10.210.6.5.63021: Flags [S.], seq 2641182587, ack 1851114975, win 28960, options [mss 1460,sackOK,TS val 12663097 ecr 997163224,nop,wscale 8], length 0
17:11:08.702206 IP 10.210.6.5.63021 > 10.210.16.12.33663: Flags [.], ack 1, win 32806, options [nop,nop,TS val 997163224 ecr 12663097], length 0
17:11:08.704470 IP 10.210.6.5.63021 > 10.210.16.12.33663: Flags [P.], seq 1:77, ack 1, win 32806, options [nop,nop,TS val 997163225 ecr 12663097], length 76
17:11:08.704487 IP 10.210.16.12.33663 > 10.210.6.5.63021: Flags [.], ack 77, win 114, options [nop,nop,TS val 12663099 ecr 997163225], length 0
17:11:08.704534 IP 10.210.16.12.33663 > 10.210.6.5.63021: Flags [P.], seq 1:29, ack 77, win 114, options [nop,nop,TS val 12663099 ecr 997163225], length 28
17:11:08.704620 IP 10.210.6.5.63021 > 10.210.16.12.33663: Flags [.], ack 29, win 32806, options [nop,nop,TS val 997163225 ecr 12663099], length 0
17:11:08.741063 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [.], ack 174377669, win 11471, options [nop,nop,TS val 12663136 ecr 997163224], length 0
17:11:13.717133 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314028084:314028244, ack 174377669, win 11471, options [nop,nop,TS val 12668112 ecr 997163224], length 160: NFS request xid 465713613 156 getattr fh 0,28/0
17:11:13.717704 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377669:174377785, ack 314028244, win 32806, options [nop,nop,TS val 997163726 ecr 12668112], length 116: NFS reply xid 465713613 reply ok 112 getattr ERROR: unk 24
17:11:13.717715 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [.], ack 174377785, win 11471, options [nop,nop,TS val 12668112 ecr 997163726], length 0
17:12:02.981295 IP 10.210.6.5.57869 > 10.210.16.12.33663: Flags [S], seq 2961022330, win 32804, options [mss 1460,sackOK,TS val 997168652 ecr 0,nop,wscale 5], length 0
...
Few seconds later we see "getattr ERROR: unk 24", and a corresponding error in application error log.
Nothing interesting in system logs on NFS server. However, on the nfs server I was also running:
Let's see log around Nov 12 17:11:02:
Notice the op-renew-done errors.
Let's see what these are:
The op-renew-done failure with 10011 (NFS4ERR_EXPIRED) matches the time when the application failed.
The other interesting thing is, that the nfs linux client has gssd running with -t 1800 so it needs to renew creds every 30 minutes.
The interesting entries here are:
This seems to be 39s later than 30 minutes between them. Although I didn't catch the successful op-renew-done here.
After some further debugging let's look at the original network dump again.
It calls reaaddir, access, .... to get to final directory and then tries to open file, server responds with NFS4ERR_EXPIRED, client tries again, gets the error again and it seems to give up and move on to other files where the story repeats for each one of them. All directory lookups, readdirs, etc. work fine though. Here is just the open, it's retry and errors for the first file that failed:
After multiple open() calls failed it, the nfs client eventually does:
which fixes the issue and a moment later it can open files fine:
The last time the client issues RENEW operation before the first failure was:
Notice that it's been issuing RENEW every 60s (fileserver sets the grace period to 90s), and then it hasn't done so for almost 166s which is more than the 90s session timeout requested by server. For sanity let's check when the first NFS4_EXPIRED error happened. The last RENEW before the issue was at 17:09:17 so unless the sessions was renewed (it wasn't) the server should start responding with errors from about 17:10:47. The first NFS4ERR_EXPIRED is recorded at 17:11:02 which is the first OPEN call happening during the time frame.
Looks like a bug in the nfs client.
Let's run some additional debugging.
Network packet capture:
Let's trace all checks for nfsv4 lease expire and record all cases where it did expire, this is traced on the nfs server:
Let's record all nfs RENEW calls:
Let's catch all nfsv4 operations and their statuses on nfs server:
# cat /var/tmp/ss
#!/bin/bash
dtrace -q -n nfsv4:::op-open-downgrade-start,nfsv4:::op-secinfo-start,nfsv4:::op-readdir-start,nfsv4:::op-setclientid-start,
nfsv4:::op-remove-start,nfsv4:::cb-recall-start,nfsv4:::op-rename-start,nfsv4:::op-commit-start,nfsv4:::op-savefh-start,
nfsv4:::op-delegpurge-start,nfsv4:::op-openattr-start,nfsv4:::op-delegreturn-start,nfsv4:::op-putpubfh-start,
nfsv4:::op-nverify-start,nfsv4:::op-verify-start,nfsv4:::op-setattr-start,nfsv4:::op-getattr-start,
nfsv4:::op-open-confirm-start,nfsv4:::null-start,nfsv4:::op-readlink-start,nfsv4:::op-lockt-start,nfsv4:::op-locku-start,
nfsv4:::op-putrootfh-start,nfsv4:::op-close-start,nfsv4:::op-access-start,nfsv4:::op-getfh-start,nfsv4:::op-read-start,
nfsv4:::op-lock-start,nfsv4:::op-open-start,nfsv4:::op-link-start,nfsv4:::op-renew-start,nfsv4:::op-write-start,
nfsv4:::op-restorefh-start,nfsv4:::op-lookupp-start,nfsv4:::op-release-lockowner-start,nfsv4:::compound-start,
nfsv4:::op-putfh-start,nfsv4:::op-setclientid-confirm-start,nfsv4:::op-lookup-start,nfsv4:::op-create-start'
/args[0]->ci_remote=="10.210.16.12"/
{self->t=timestamp;
printf("%Y %s %s %s\n", walltimestamp, probename, args[0]->ci_remote, args[0]->ci_local);}' \
-n nfsv4:::op-open-downgrade-done,nfsv4:::op-readdir-done,nfsv4:::op-access-done,nfsv4:::op-lock-done,nfsv4:::op-link-done,
nfsv4:::op-setclientid-done,nfsv4:::op-open-confirm-done,nfsv4:::op-open-done,nfsv4:::op-create-done,
nfsv4:::op-secinfo-done,nfsv4:::op-locku-done,nfsv4:::op-close-done,nfsv4:::op-write-done,
nfsv4:::op-delegreturn-done,nfsv4:::op-getattr-done,nfsv4:::op-verify-done,nfsv4:::op-getfh-done,
nfsv4:::op-putfh-done,nfsv4:::op-nverify-done,nfsv4:::op-readlink-done,nfsv4:::op-renew-done,
nfsv4:::op-savefh-done,nfsv4:::op-read-done,nfsv4:::compound-done,nfsv4:::cb-recall-done,
nfsv4:::op-lookup-done,nfsv4:::op-restorefh-done,nfsv4:::op-release-lockowner-done,nfsv4:::op-lookupp-done,
nfsv4:::op-openattr-done,nfsv4:::op-putpubfh-done,nfsv4:::op-putrootfh-done,
nfsv4:::op-setclientid-confirm-done,nfsv4:::op-remove-done,nfsv4:::op-rename-done,nfsv4:::op-commit-done,
nfsv4:::op-delegpurge-done,nfsv4:::op-setattr-done,nfsv4:::op-lockt-done'
/self->t/
{printf("%Y %s %s %s took: %d ret: %d\n", walltimestamp, probename, args[0]->ci_remote, args[0]->ci_local,
(timestamp-self->t)/1000000000, args[2]->status);
self->t=0;}' \
-n nfsv4:::null-done'/self->t/
{printf("%Y %s %s %s took: %d\n", walltimestamp, probename, args[0]->ci_remote, args[0]->ci_local,
(timestamp-self->t)/1000000000);
self->t=0;}' \
-o /var/tmp/nfs4_all_ops.log
# /var/tmp/ss &
The list of running jobs on the nfs server is:
These have been running without production traffic, so there is little going on there.
Still client needs to renew its nfsv4 session as long as the filesystem is mounted, etc.
First let's see how often the client issues RENEW operations and when it fails:
It's been issuing the RENEW ops at every 60s but then there is a gap of 114s (17:39:03 - 17:40:57) which is more than 90s requested by the server, so the server replies with 10011 (NFS4ERR_EXPIRED). If now an open() happened it would have failed as this client doesn't seem to handle NFS4ERR_RENEW returned by open. Notice that it hits the issue for both server IP addresses at the same time (both served by the same nfs server), that's why there are two RENEW calls. It then starts working again and it starts sending new RENEW calls regularly. It must have issued setclientid to fix the state though, let's check it in the other log with all nfs operations:
Yes, it did issue setclientid (for both 10.210.6.56 and .6).
Let's see how often it issues the setclientid:
It issues the setclientid calls either every 1hr or at about every 30 minutes intervals.
Let's see if it corresponds to RENEW called which got 10011 (NFS4ERR_EXPIRED):
It does. So every ~30 minutes or ~1hr the client doesn't issuew RENEW ontime and has to call SETCLIENTID. It seems that if during the window there is an OPEN issued it will get NFS4ERR_EXPIRED and fail as Linux doesn't seem to issue RENEW and then retry the OPEN. I think the file can't be already opened and attributes can't be cached to hit the issue.
Now the question is why Linux is late here to issue RENEW sometimes.
I re-enabled application and it failed, while stap on the nfs client recorded:
From the debug log on the nfs client (for the 16:41:51 case):
On the nfs server:
Let's see renew operations:
And the client recovery from its debug log:
This is a gap of 236s (well over the 90s) between 16:38:20 and 16:42:16 and all open errors happened only during this exact time period:
nfs4renewd.c /nfs4_renew_state() prints “failed to call renewd. Reason: lease not expired” and later dprintk("%s: requeueing work. Lease period = %ld\n", func, (timeout + HZ - 1) / HZ);
Prints 57s, this suggests 3s before fsinfo() was run on the server, if solaris wouldn’t renew lease on fsinfo() then RENEW wouldn’t be send by nfs4_renew_state() and the client would end up without a lease and would have to renew.
I used stap to confirm it.
nfs4proc.c /nfs4_do_fsinfo()
nfs4_do_open() -> _nfs4_do_open() -> nfs4_recover_expired_lease() -> nfs4_client_recover_expired_lease() -> test_bit(NFS4CLNT_LEASE_EXPIRED) 10 times in a loop, if not set return EIO
The only thing which will set the bit is nfs4renewd.c/nfs4_renew_state() but if something else is modifying clp->cl_last_renewal and casing nfs4_renew_state() not to set the bit, then open() won’t recover. We do see the dprintk("%s: requeueing work. Lease period = %ld\n"… printed by nfs4_schedule_state_renewal() with value of 57 (and others) which means 3s before something updated cl_last_renewal.
Now nfs4_do_fsinfo() calls nfs4_set_lease_period().
Many nfs4 ops reset lease as well, so they assume on successful op server would do the same.
For example:
Let’s see what is calling nfs4_do_fsinfo():
Let’s get some more info:
Notice that fsifno is actually a set of GETATTR operations in nfsv4:
Let's double check that the nfs4_do_fsinfo() is caused by vfs/fstat() call or similar by ls:
Yes, it is. So the
btw: just running ls is enough to trigger it after data is no longer cached or better unmount it first:
So what happens here is that Linux will unmount cross-mountpoints after /proc/sys/fs/nfs/nfs_mountpoint_timeout which on this server has the default value of 500. Next time anything triggers mounting a cross-mouint linux will send fsinfo() call (nfs4_do_fsinfo()).
For now, assuming the nfs4_do_fsinfo() only gets called on mounting cross-mount (not necesarilly true), to test a possible workaround let's modify linux so it wouldn't unmount sub-mounts for 7 days - on linux client:
Also running:
I've managed to reproduce the issue.
So now Linux is not going to send RENEW and assume its lease got extended.
At 15:40:41+90s solaris server will assume the client's lease expired, so if we call open between 15:42:11 - 15:42:36 then open should fail:
and it did fail, while when re-run again after linux sent renew then it worked.
It did.
I've also managed to reproduce the same issue against Linux NFS server.
Then at 13:52:08, 2 seconds after the client's lease expired from server point of view (13:50:36 + 90s = 13:52:06), I run:
Then, at 13:52:33 (13:51:36 + 57s), the client send RENEW operation, and had to recover via SETCLIENTID, etc. and now open() works fine:
The Linux nfs server behaves the same as Solaris nfs server.
The linux is CentOS 7.4 running 3.10.0-693.11.6.el7.x86_64.
Also reproduced against Solaris 11.4.15.5.0 NFS server.
A workaround on Linux nfs client hosts is to increase sysctl/fs.nfs.nfs_mountpoint_timeout from 500 (default) to max allowed value: 2147483.
I couldn't reproduce the issue against NetApp filers though, some clarification from NetApp support:
Knowing now more how ONTAP behaves I re-created the issue as well, although the behavior is different.
It is easier to cause do_fsinfo() every 10s as it will permanently stop linux client from sending RENEW operation.
Notice that the open did not fail. Network dump:
So ONTAP responds with NFS4ERR_STALE_CLIENTID instead of NFS4ERR_EXPIRED.
The nfs4_do_open() function does end up calling nfs4_do_handle_exception(), the relevant code is:
Not sure why it is handled differently in nfs4_do_handle_exception().
.
This is the first time the linux client assumes implicit lease renewal when it calls fsinfio() (getattrs for mounts). The code was later refactored a little bit but latest upstream version still does it (at the time of debugging the issue).
The below two patches I contributed to Linux kernel are now also present in CentOS 8.3 with kernel 4.18.0-240.1.1.el8_3 and newer.