Robert Milkowski's blog: NFS4ERR_EXPIRED - Linux NFS client bug (fixed)

Wednesday, December 28, 2022

NFS4ERR_EXPIRED - Linux NFS client bug (fixed)

Another blog entry saved as draft for a long time and waiting to be published... better late than never, so let's do it now.
Over three years ago I found a bug in Linux nfs v4.0 client which results in applications temporarily not being able to open a file and getting an error. While this is due to a bug in the nfs v4.0 client code, depending on a specific nfs server you might or might not hit the issue. For example, both Linux and Solaris NFS servers will trigger the bug, while NetApp/ONTAP won't.

Once I debuged and understood the problem and was able to reproduce it. Then I ended up with a temporary workaround and later submitted two patches to Linux kernel which quickly have been integrated into Linux 5.6 and backported to some older versions (see the patches at the very end of this blog entry).

Let's look in more detail at the issue.

btw: some output has been abbreviated and/or modified to remove irrelevant details or to anonymize details.

An application written in java was sometimes failing to open a file with the below exception:

Caused by: java.io.FileNotFoundException: /.../latest.link (Input/output error) 
        at java.io.FileInputStream.open0(Native Method) 
        at java.io.FileInputStream.open(FileInputStream.java:195) 
        at java.io.FileInputStream.<init>(FileInputStream.java:138) 
        ...

Developers added an retry logic which helped. Usually after few retries the application was able to open the affected file fine. This was happening to different files, at different times on multiple servers. Even to files which were created long time ago (don't get fooled by the java.io.FileNotFoundException).

To understand how a file is accessed by the application (although some other applications were affected too), I wrote the below code based on the affected application.

$ cat Test.java
import java.io.*;
...

public class Test {
    public static void main(String[] args) {
        FileUtils fu = new FileUtils();
        File dir = new File("/mnt/.../indexweights");
        File ft = new File("/tmp/foo");
        try {
            ft = fu.resolveLinkFile(dir);
        } catch (IOException e) {
            System.out.println("IOException: " + e.getMessage());
            e.printStackTrace();
        } finally {
            System.out.println("Destination path: " + ft.getAbsolutePath());
        }
    }
}

Let's run it under strace and see what system calls it calls:

$ strace -tTfv -o /tmp/a -e trace=file,close,read java Test
Destination path: /mnt/...

$ grep -A2 latest.link /tmp/a
4422  15:33:38 open("/mnt/.../indexweights/latest.link", O_RDONLY) = 5 <0.069349>
4422  15:33:38 read(5, "...", 8192) = 11 <0.000010>
4422  15:33:38 close(5)                 = 0 <0.069149>

It is just a simple open() with O_RDONLY followed by read() and close(). Nothing suspicious. The error is happening with open() though, based on the java stack in the app log.
I did run the test program multiple times on affected servers and tried to access the affected files, but couldn't trigger the issue.

We modified one of the affected applications and added a retry logic for failed file opens, we also run part of the application under strace to see what's the actual error reported by open(). Let's see an example output:

$ egrep ' open\(|open resume' .../XXX.out.20190625220935.26281
...
[pid 133595] 22:10:17 open("/mnt/.../latest.link", O_RDONLY <unfinished ...>
[pid 133595] 22:10:17 <... open resumed> ) = -1 EIO (Input/output error) <0.069362>
...
[pid 133595] 22:10:21 open("/mnt/.../latest.link", O_RDONLY <unfinished ...>
[pid 133595] 22:10:21 <... open resumed> ) = -1 EIO (Input/output error) <0.069204>
...
[pid 133595] 22:10:29 open("/mnt/.../latest.link", O_RDONLY <unfinished ...>
[pid 133595] 22:10:29 <... open resumed> ) = 192 <0.487662>

The open() failed with EIO two times and then succeeded the third time.
Given these are nfs4 mount points with "hard" mount option we shouldn't be getting EIO though.
When the issue happens no errors reported in system logs.
I also collected network traffic for the above failures.

# tshark -r tcpdump.out.201906252205 -tad nfs
...
160040 2019-06-25 22:10:17 10.2.1.142 -> 10.2.6.11  NFS 242 V4 Call ACCESS FH: 0x18703b49, [Check: RD LU MD XT DL]
160044 2019-06-25 22:10:17  10.2.6.11 -> 10.2.1.142 NFS 222 V4 Reply (Call In 160040) ACCESS, [Access Denied: MD XT DL], [Allowed: RD LU]
160066 2019-06-25 22:10:17 10.2.1.142 -> 10.2.6.11  NFS 322 V4 Call OPEN DH: 0x18703b49/latest.link
160074 2019-06-25 22:10:17  10.2.6.11 -> 10.2.1.142 NFS 150 V4 Reply (Call In 160066) OPEN Status: NFS4ERR_EXPIRED
160793 2019-06-25 22:10:18 10.2.1.142 -> 10.2.6.11  NFS 322 V4 Call OPEN DH: 0x18703b49/latest.link
160799 2019-06-25 22:10:18  10.2.6.11 -> 10.2.1.142 NFS 150 V4 Reply (Call In 160793) OPEN Status: NFS4ERR_EXPIRED
161368 2019-06-25 22:10:19 10.2.1.142 -> 10.2.6.11  NFS 322 V4 Call OPEN DH: 0x18703b49/latest.link
161394 2019-06-25 22:10:19  10.2.6.11 -> 10.2.1.142 NFS 150 V4 Reply (Call In 161368) OPEN Status: NFS4ERR_EXPIRED
162036 2019-06-25 22:10:21 10.2.1.142 -> 10.2.6.11  NFS 234 V4 Call GETATTR FH: 0x7a1ffedc
162112 2019-06-25 22:10:21  10.2.6.11 -> 10.2.1.142 NFS 338 V4 Reply (Call In 162036) GETATTR
162114 2019-06-25 22:10:21 10.2.1.142 -> 10.2.6.11  NFS 322 V4 Call OPEN DH: 0x18703b49/latest.link
162194 2019-06-25 22:10:21  10.2.6.11 -> 10.2.1.142 NFS 150 V4 Reply (Call In 162114) OPEN Status: NFS4ERR_EXPIRED
162803 2019-06-25 22:10:22 10.2.1.142 -> 10.2.6.6   NFS 226 V4 Call RENEW CID: 0x8dbc
162811 2019-06-25 22:10:22   10.2.6.6 -> 10.2.1.142 NFS 182 V4 Reply (Call In 162803) RENEW Status: NFS4ERR_EXPIRED
162813 2019-06-25 22:10:22 10.2.1.142 -> 10.2.6.6   NFS 226 V4 Call RENEW CID: 0x8dbc
162825 2019-06-25 22:10:22   10.2.6.6 -> 10.2.1.142 NFS 182 V4 Reply (Call In 162813) RENEW Status: NFS4ERR_EXPIRED
162826 2019-06-25 22:10:22 10.2.1.142 -> 10.2.6.6   NFS 318 V4 Call SETCLIENTID
162850 2019-06-25 22:10:23   10.2.6.6 -> 10.2.1.142 NFS 198 V4 Reply (Call In 162826) SETCLIENTID
162851 2019-06-25 22:10:23 10.2.1.142 -> 10.2.6.6   NFS 234 V4 Call SETCLIENTID_CONFIRM
162858 2019-06-25 22:10:23   10.2.6.6 -> 10.2.1.142 NFS 182 V4 Reply (Call In 162851) SETCLIENTID_CONFIRM
163272 2019-06-25 22:10:24 10.2.1.142 -> 10.2.6.11  NFS 226 V4 Call RENEW CID: 0x0a7f
163302 2019-06-25 22:10:25  10.2.6.11 -> 10.2.1.142 NFS 182 V4 Reply (Call In 163272) RENEW Status: NFS4ERR_EXPIRED
163304 2019-06-25 22:10:25 10.2.1.142 -> 10.2.6.11  NFS 226 V4 Call RENEW CID: 0x0a7f
163332 2019-06-25 22:10:25  10.2.6.11 -> 10.2.1.142 NFS 182 V4 Reply (Call In 163304) RENEW Status: NFS4ERR_EXPIRED
163333 2019-06-25 22:10:25 10.2.1.142 -> 10.2.6.11  NFS 318 V4 Call SETCLIENTID
163343 2019-06-25 22:10:25  10.2.6.11 -> 10.2.1.142 NFS 198 V4 Reply (Call In 163333) SETCLIENTID
163344 2019-06-25 22:10:25 10.2.1.142 -> 10.2.6.11  NFS 234 V4 Call SETCLIENTID_CONFIRM
163382 2019-06-25 22:10:25  10.2.6.11 -> 10.2.1.142 NFS 182 V4 Reply (Call In 163344) SETCLIENTID_CONFIRM
164083 2019-06-25 22:10:27 10.2.1.142 -> 10.2.6.5   NFS 1606 V4 NULL Call
164109 2019-06-25 22:10:27   10.2.6.5 -> 10.2.1.142 NFS 302 V4 NULL Reply (Call In 164083)
164289 2019-06-25 22:10:28 10.2.1.142 -> 10.2.6.6   NFS 1606 V4 NULL Call
164301 2019-06-25 22:10:28   10.2.6.6 -> 10.2.1.142 NFS 302 V4 NULL Reply (Call In 164289)
164303 2019-06-25 22:10:28 10.2.1.142 -> 10.2.6.6   NFS 226 V4 Call RENEW CID: 0x9d10
164326 2019-06-25 22:10:28   10.2.6.6 -> 10.2.1.142 NFS 182 V4 Reply (Call In 164303) RENEW
164417 2019-06-25 22:10:28 10.2.1.142 -> 10.2.6.11  NFS 242 V4 Call ACCESS FH: 0x118824df, [Check: RD LU MD XT DL]
164427 2019-06-25 22:10:28  10.2.6.11 -> 10.2.1.142 NFS 222 V4 Reply (Call In 164417) ACCESS, [NOT Supported: MD XT DL], [Allowed: RD LU]
164429 2019-06-25 22:10:28 10.2.1.142 -> 10.2.6.11  NFS 234 V4 Call GETATTR FH: 0x7a1ffedc
164440 2019-06-25 22:10:28  10.2.6.11 -> 10.2.1.142 NFS 338 V4 Reply (Call In 164429) GETATTR
164441 2019-06-25 22:10:28 10.2.1.142 -> 10.2.6.11  NFS 322 V4 Call OPEN DH: 0x18703b49/latest.link
164458 2019-06-25 22:10:28  10.2.6.11 -> 10.2.1.142 NFS 466 V4 Reply (Call In 164441) OPEN StateID: 0xae70
164459 2019-06-25 22:10:28 10.2.1.142 -> 10.2.6.11  NFS 242 V4 Call OPEN_CONFIRM
164473 2019-06-25 22:10:29  10.2.6.11 -> 10.2.1.142 NFS 166 V4 Reply (Call In 164459) OPEN_CONFIRM
164474 2019-06-25 22:10:29 10.2.1.142 -> 10.2.6.11  NFS 250 V4 Call READ StateID: 0xa087 Offset: 0 Len: 11
164498 2019-06-25 22:10:29  10.2.6.11 -> 10.2.1.142 NFS 170 V4 Reply (Call In 164474) READ
164499 2019-06-25 22:10:29 10.2.1.142 -> 10.2.6.11  NFS 234 V4 Call GETATTR FH: 0xedefdae0
164519 2019-06-25 22:10:29  10.2.6.11 -> 10.2.1.142 NFS 338 V4 Reply (Call In 164499) GETATTR
164520 2019-06-25 22:10:29 10.2.1.142 -> 10.2.6.11  NFS 258 V4 Call CLOSE StateID: 0xa087
164545 2019-06-25 22:10:29  10.2.6.11 -> 10.2.1.142 NFS 230 V4 Reply (Call In 164520) CLOSE
...

Looks like nfs session is expiring and asynchronously renewing, failing in the meantime.

$ sudo /var/tmp/syscall-open-stat-filename-regex.stp -g --suppress-time-limits -o /var/tmp/failed_stats_and_opens.log 'XXX.*zip' &
[5] 21855

We caught some errors:

# grep '= -' /var/tmp/failed_stats_and_opens.log
Tue Nov 12 17:11:02 2019 java[16043] SyS_open("XXX") = -5 (EIO) : -5
Tue Nov 12 17:11:02 2019 java[16078] SyS_open("XXX") = -5 (EIO) : -5
...

Let's look at one java thread:

# grep 16043 /var/tmp/failed_stats_and_opens.log
Tue Nov 12 17:11:02 2019 java[16043] SyS_newlstat("/mnt/.../XXX.zip") = 0 : 0
Tue Nov 12 17:11:02 2019 java[16043] SyS_newstat("/mnt/.../XXX.zip") = 0 : 0
Tue Nov 12 17:11:02 2019 java[16043] SyS_newstat("/mnt/.../XXX.zip") = 0 : 0
Tue Nov 12 17:11:02 2019 java[16043] SyS_newstat("/mnt/.../XXX.zip") = 0 : 0
Tue Nov 12 17:11:02 2019 java[16043] SyS_open("/mnt/.../XXX.zip") = -5 (EIO) : -5

So within the same second it stat'ed the file multiple times just fine (1x lstat() + 3x stat()) only to get EIO on open().
Let's look at tcpdump capture:

# tcpdump -n -r /var/tmp/tcpdump.XXX.out
...
11:02.766890 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75637865:75638065, ack 2022930725, win 12288, options [nop,nop,TS val 12657161 ecr 997162631], length 200: NFS request xid 2489184212 196 getattr fh 0,0/22
11:02.767043 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022930725:2022931065, ack 75638065, win 32806, options [nop,nop,TS val 997162631 ecr 12657161], length 340: NFS reply xid 2489184212 reply ok 336 getattr NON 4 ids 0/15 sz 0
11:02.769335 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75638065:75638329, ack 2022931065, win 12288, options [nop,nop,TS val 12657164 ecr 997162631], length 264: NFS request xid 2505961428 260 getattr fh 0,0/22
11:02.769477 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022931065:2022931149, ack 75638329, win 32806, options [nop,nop,TS val 997162631 ecr 12657164], length 84: NFS reply xid 2505961428 reply ok 80 getattr ERROR: unk 10011
11:02.769544 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75638329:75638593, ack 2022931149, win 12288, options [nop,nop,TS val 12657164 ecr 997162631], length 264: NFS request xid 2522738644 260 getattr fh 0,0/22
11:02.769638 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022931149:2022931233, ack 75638593, win 32806, options [nop,nop,TS val 997162631 ecr 12657164], length 84: NFS reply xid 2522738644 reply ok 80 getattr ERROR: unk 10011
11:02.809052 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [.], ack 2022931233, win 12288, options [nop,nop,TS val 12657204 ecr 997162631], length 0
11:02.821772 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75638593:75638789, ack 2022931233, win 12288, options [nop,nop,TS val 12657216 ecr 997162631], length 196: NFS request xid 2539515860 192 getattr fh 0,0/22
11:02.822228 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022931233:2022931573, ack 75638789, win 32806, options [nop,nop,TS val 997162636 ecr 12657216], length 340: NFS reply xid 2539515860 reply ok 336 getattr NON 4 ids 0/15 sz 0
11:02.822261 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [.], ack 2022931573, win 12288, options [nop,nop,TS val 12657217 ecr 997162636], length 0
11:02.823327 IP 10.210.16.12.cisco-tdp > 10.210.6.6.nfs: Flags [P.], seq 75638789:75638957, ack 2022931573, win 12288, options [nop,nop,TS val 12657218 ecr 997162636], length 168: NFS request xid 2556293076 164 getattr fh 0,0/22
11:02.823436 IP 10.210.6.6.nfs > 10.210.16.12.cisco-tdp: Flags [P.], seq 2022931573:2022931857, ack 75638957, win 32806, options [nop,nop,TS val 997162636 ecr 12657218], length 284: NFS reply xid 2556293076 reply ok 280 getattr NON 2 ids 0/9 sz 0
...

It correlates with "getattr ERROR: unk 10011"
Let's also look at tcpdump capture for YYY:

# tcpdump -n -r /var/tmp/tcpdump.YYY.out
...
10:08.447238 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [.], ack 174377189, win 11471, options [nop,nop,TS val 12602842 ecr 997157199], length 0
11:08.645057 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [.], ack 174377189, win 11471, options [nop,nop,TS val 12663040 ecr 997157199], length 0
11:08.645147 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [.], ack 314027332, win 32806, options [nop,nop,TS val 997163219 ecr 12602842], length 0
11:08.645173 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314027332:314027492, ack 174377189, win 11471, options [nop,nop,TS val 12663040 ecr 997163219], length 160: NFS request xid 398604749 156 getattr fh 0,28/0
11:08.645249 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [.], ack 314027492, win 32806, options [nop,nop,TS val 997163219 ecr 12663040], length 0
11:08.692687 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377189:174377305, ack 314027492, win 32806, options [nop,nop,TS val 997163223 ecr 12663040], length 116: NFS reply xid 398604749 reply ok 112 getattr ERROR: unk 24
11:08.692796 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314027492:314027652, ack 174377305, win 11471, options [nop,nop,TS val 12663087 ecr 997163223], length 160: NFS request xid 415381965 156 getattr fh 0,28/0
11:08.693123 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377305:174377421, ack 314027652, win 32806, options [nop,nop,TS val 997163223 ecr 12663087], length 116: NFS reply xid 415381965 reply ok 112 getattr ERROR: unk 24
11:08.693164 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314027652:314027916, ack 174377421, win 11471, options [nop,nop,TS val 12663088 ecr 997163223], length 264: NFS request xid 432159181 260 getattr fh 0,132/0
11:08.693592 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377421:174377553, ack 314027916, win 32806, options [nop,nop,TS val 997163223 ecr 12663088], length 132: NFS reply xid 432159181 reply ok 128 getattr ERROR: unk 40
11:08.693680 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314027916:314028084, ack 174377553, win 11471, options [nop,nop,TS val 12663088 ecr 997163223], length 168: NFS request xid 448936397 164 getattr fh 0,36/0
11:08.701988 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377553:174377669, ack 314028084, win 32806, options [nop,nop,TS val 997163224 ecr 12663088], length 116: NFS reply xid 448936397 reply ok 112 getattr ERROR: unk 24
11:08.702090 IP 10.210.6.5.63021 > 10.210.16.12.33663: Flags [S], seq 1851114974, win 32804, options [mss 1460,sackOK,TS val 997163224 ecr 0,nop,wscale 5], length 0
11:08.702122 IP 10.210.16.12.33663 > 10.210.6.5.63021: Flags [S.], seq 2641182587, ack 1851114975, win 28960, options [mss 1460,sackOK,TS val 12663097 ecr 997163224,nop,wscale 8], length 0
11:08.702206 IP 10.210.6.5.63021 > 10.210.16.12.33663: Flags [.], ack 1, win 32806, options [nop,nop,TS val 997163224 ecr 12663097], length 0
11:08.704470 IP 10.210.6.5.63021 > 10.210.16.12.33663: Flags [P.], seq 1:77, ack 1, win 32806, options [nop,nop,TS val 997163225 ecr 12663097], length 76
11:08.704487 IP 10.210.16.12.33663 > 10.210.6.5.63021: Flags [.], ack 77, win 114, options [nop,nop,TS val 12663099 ecr 997163225], length 0
11:08.704534 IP 10.210.16.12.33663 > 10.210.6.5.63021: Flags [P.], seq 1:29, ack 77, win 114, options [nop,nop,TS val 12663099 ecr 997163225], length 28
11:08.704620 IP 10.210.6.5.63021 > 10.210.16.12.33663: Flags [.], ack 29, win 32806, options [nop,nop,TS val 997163225 ecr 12663099], length 0
11:08.741063 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [.], ack 174377669, win 11471, options [nop,nop,TS val 12663136 ecr 997163224], length 0
11:13.717133 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [P.], seq 314028084:314028244, ack 174377669, win 11471, options [nop,nop,TS val 12668112 ecr 997163224], length 160: NFS request xid 465713613 156 getattr fh 0,28/0
11:13.717704 IP 10.210.6.5.nfs > 10.210.16.12.841: Flags [P.], seq 174377669:174377785, ack 314028244, win 32806, options [nop,nop,TS val 997163726 ecr 12668112], length 116: NFS reply xid 465713613 reply ok 112 getattr ERROR: unk 24
11:13.717715 IP 10.210.16.12.841 > 10.210.6.5.nfs: Flags [.], ack 174377785, win 11471, options [nop,nop,TS val 12668112 ecr 997163726], length 0
12:02.981295 IP 10.210.6.5.57869 > 10.210.16.12.33663: Flags [S], seq 2961022330, win 32804, options [mss 1460,sackOK,TS val 997168652 ecr 0,nop,wscale 5], length 0
...

Few seconds later we see "getattr ERROR: unk 24", and a corresponding error in application error log.
Nothing interesting in system logs on NFS server. However, on the nfs server I was also running:

# dtrace -q -n nfsv4:::op-getattr-done,nfsv4:::op-readdir-done,nfsv4:::op-access-done,
                        nfsv4:::op-link-done,nfsv4:::op-open-done,nfsv4:::op-create-done,
                        nfsv4:::op-delegreturn-done,nfsv4:::op-verify-done,nfsv4:::op-getfh-done,
                        nfsv4:::op-readlink-done,nfsv4:::op-renew-done,nfsv4:::op-lookup-done,
                        nfsv4:::op-remove-done,nfsv4:::op-rename-done'
                        /args[2]->status != 0/
                        {
                          printf("%Y %s %s %d:%d %s %d\n", walltimestamp, probename, args[0]->ci_remote,
                                 args[1]->noi_cred->cr_uid, args[1]->noi_cred->cr_gid,
                                 args[1]->noi_curpath, args[2]->status);}'
                  -o /var/tmp/nfs-ops-with-errno.log

Let's see log around Nov 12 17:11:02:

...
Nov 12 17:10:42 op-lookup-done 10.250.3.29 10006:5000 /.../XXX 2
Nov 12 17:10:41 op-renew-done 10.210.13.49 60001:60001 <unknown> 10022
Nov 12 17:10:46 op-renew-done 10.210.13.49 60001:60001 <unknown> 10022
Nov 12 17:10:46 op-lookup-done 10.250.3.29 10006:5000 <unknown> 2
Nov 12 17:10:49 op-lookup-done 10.210.10.152 10006:5000 <unknown> 2
Nov 12 17:10:50 op-lookup-done 10.250.3.29 10006:5000 /.../XXX 2
...
Nov 12 17:10:55 op-lookup-done 10.250.3.29 10006:5000 <unknown> 2
Nov 12 17:10:56 op-renew-done 10.210.13.49 60001:60001 <unknown> 10022
Nov 12 17:10:58 op-renew-done 10.210.136.15 60001:60001 <unknown> 10011
Nov 12 17:10:59 op-renew-done 10.210.136.15 60001:60001 <unknown> 10011
Nov 12 17:10:59 op-renew-done 10.210.136.15 60001:60001 <unknown> 10011
Nov 12 17:10:58 op-renew-done 10.210.136.15 60001:60001 <unknown> 10011
...

Notice the op-renew-done errors.
Let's see what these are:

# ggrep -E '10011|10022' /usr/include/nfs/nfs4_kprot.h
        NFS4ERR_EXPIRED = 10011,
        NFS4ERR_STALE_CLIENTID = 10022,

The op-renew-done failure with 10011 (NFS4ERR_EXPIRED) matches the time when the application failed.
The other interesting thing is, that the nfs linux client has gssd running with -t 1800 so it needs to renew creds every 30 minutes.

# ggrep 'Nov 12' /var/tmp/nfs-ops-with-errno.log | ggrep 10.210.16.12  | ggrep op-renew
Nov 12 15:10:45 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 15:10:45 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 15:40:57 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 15:40:56 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 16:04:58 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 16:04:58 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 16:10:58 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 16:10:58 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 16:40:29 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 16:40:29 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 17:11:08 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 17:11:08 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
...

The interesting entries here are:

Nov 12 16:40:29 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 16:40:29 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 17:11:08 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011
Nov 12 17:11:08 op-renew-done 10.210.16.12 60001:60001 <unknown> 10011

This seems to be 39s later than 30 minutes between them. Although I didn't catch the successful op-renew-done here.

After some further debugging let's look at the original network dump again.
It calls reaaddir, access, .... to get to final directory and then tries to open file, server responds with NFS4ERR_EXPIRED, client tries again, gets the error again and it seems to give up and move on to other files where the story repeats for each one of them. All directory lookups, readdirs, etc. work fine though. Here is just the open, it's retry and errors for the first file that failed:

102126 17:11:02.769335 10.210.16.12 10.210.6.6 NFS 330 V4 Call (Reply In 102127) PUTFH | OPEN DH: 0xcbbed605/XXX.zip | GETFH | ACCESS FH: 0x00000000, [Check: RD MD XT XE] | GETATTR
102127 17:11:02.769477 10.210.6.6 10.210.16.12 NFS 150 V4 Reply (Call In 102126) PUTFH | OPEN Status: NFS4ERR_EXPIRED
102128 17:11:02.769544 10.210.16.12 10.210.6.6 NFS 330 V4 Call (Reply In 102129) PUTFH | OPEN DH: 0xcbbed605/XXX.zip | GETFH | ACCESS FH: 0x00000000, [Check: RD MD XT XE] | GETATTR
102129 17:11:02.769638 10.210.6.6 10.210.16.12 NFS 150 V4 Reply (Call In 102128) PUTFH | OPEN Status: NFS4ERR_EXPIRED

After multiple open() calls failed it, the nfs client eventually does:

102831 17:12:02.917119 10.210.16.12 10.210.6.6 NFS 226 V4 Call (Reply In 102832) RENEW CID: 0xab20
102832 17:12:02.964745 10.210.6.6 10.210.16.12 NFS 182 V4 Reply (Call In 102831) RENEW Status: NFS4ERR_EXPIRED
102834 17:12:02.967293 10.210.16.12 10.210.6.6 NFS 226 V4 Call (Reply In 102835) RENEW CID: 0xab20
102835 17:12:02.967972 10.210.6.6 10.210.16.12 NFS 182 V4 Reply (Call In 102834) RENEW Status: NFS4ERR_EXPIRED
102836 17:12:02.968021 10.210.16.12 10.210.6.6 NFS 330 V4 Call (Reply In 102837) SETCLIENTID
102837 17:12:02.968332 10.210.6.6 10.210.16.12 NFS 198 V4 Reply (Call In 102836) SETCLIENTID
102838 17:12:02.968365 10.210.16.12 10.210.6.6 NFS 234 V4 Call (Reply In 102839) SETCLIENTID_CONFIRM
102839 17:12:02.981174 10.210.6.6 10.210.16.12 NFS 182 V4 Reply (Call In 102838) SETCLIENTID_CONFIRM

which fixes the issue and a moment later it can open files fine:

102874 17:12:04.838647 10.210.16.12 10.210.6.6 NFS 330 V4 Call (Reply In 102875) PUTFH | OPEN DH: 0x6030bc46/XXX.zip | GETFH | ACCESS FH: 0x00000000, [Check: RD MD XT XE] | GETATTR
102875 17:12:04.838844 10.210.6.6 10.210.16.12 NFS 478 V4 Reply (Call In 102874) PUTFH | OPEN StateID: 0x9aaf | GETFH | ACCESS, [Access Denied: XE], [Allowed: RD MD XT] | GETATTR

The last time the client issues RENEW operation before the first failure was:

101689 17:06:16.805156 10.210.16.12 10.210.6.6 NFS 226 V4 Call (Reply In 101691) RENEW CID: 0xab20
101691 17:06:16.838027 10.210.6.6 10.210.16.12 NFS 182 V4 Reply (Call In 101689) RENEW
101694 17:07:16.965131 10.210.16.12 10.210.6.6 NFS 226 V4 Call (Reply In 101697) RENEW CID: 0xab20
101697 17:07:16.997082 10.210.6.6 10.210.16.12 NFS 182 V4 Reply (Call In 101694) RENEW
101701 17:08:17.125137 10.210.16.12 10.210.6.6 NFS 226 V4 Call (Reply In 101703) RENEW CID: 0xab20
101703 17:08:17.157783 10.210.6.6 10.210.16.12 NFS 182 V4 Reply (Call In 101701) RENEW
101706 17:09:17.285124 10.210.16.12 10.210.6.6 NFS 226 V4 Call (Reply In 101709) RENEW CID: 0xab20
101709 17:09:17.320893 10.210.6.6 10.210.16.12 NFS 182 V4 Reply (Call In 101706) RENEW
... no more RENEW operations until:
102831 17:12:02.917119 10.210.16.12 10.210.6.6 NFS 226 V4 Call (Reply In 102832) RENEW CID: 0xab20

Notice that it's been issuing RENEW every 60s (fileserver sets the grace period to 90s), and then it hasn't done so for almost 166s which is more than the 90s session timeout requested by server. For sanity let's check when the first NFS4_EXPIRED error happened. The last RENEW before the issue was at 17:09:17 so unless the sessions was renewed (it wasn't) the server should start responding with errors from about 17:10:47. The first NFS4ERR_EXPIRED is recorded at 17:11:02 which is the first OPEN call happening during the time frame.
Looks like a bug in the nfs client.

Let's run some additional debugging.
Network packet capture:

# tcpdump -i aggr1 -w /var/tmp/tcpdump.20191119.out host 10.210.16.12  &

Let's trace all checks for nfsv4 lease expire and record all cases where it did expire, this is traced on the nfs server:

# dtrace -q -n fbt::rfs4_lease_expired:entry' {self->cp=args[0];}' 
                     -n fbt::rfs4_lease_expired:return'
                        /arg1 && self->cp/
                        {this->s=(struct sockaddr_in *)&self->cp->rc_addr;
                         printf("%Y %s client_ip: %s client_id: %x forced_expire: %d last_access: %u dt: %d rfs4_lease_time: %d ret: %d\n",
                                  walltimestamp, probefunc, inet_ntop(this->s->sin_family, (void *)&this->s->sin_addr), 
                                  self->cp->rc_clientid, self->cp->rc_forced_expire, self->cp->rc_last_access, 
                                  (walltimestamp/1000000000)-self->cp->rc_last_access, `rfs4_lease_time, arg1);
                         self->cp=0;stack()}'  
                     -n fbt::rfs4_lease_expired:return'/self->cp/{self->cp=0;}' -o /var/tmp/rfs4_lease_expired_failed.out &

Let's record all nfs RENEW calls:

# dtrace -q -n nfsv4:::op-renew-start'
                     {self->t=timestamp; 
                      printf("%Y %s client_id: 0x%x noi_xid: %u %s %d:%d %s\n", walltimestamp, probename, args[2]->clientid,
                             args[1]->noi_xid, args[0]->ci_remote, args[1]->noi_cred->cr_uid, args[1]->noi_cred->cr_gid, args[1]->noi_curpath);}' 
                     -n nfsv4:::op-renew-done'
                        /self->t/
                        {printf("%Y %s noi_xid: %u %s %s %d:%d %s ret: %d took: %d seconds\n", walltimestamp, probename,
                                args[1]->noi_xid, args[0]->ci_remote, args[0]->ci_local, args[1]->noi_cred->cr_uid,
                                args[1]->noi_cred->cr_gid, args[1]->noi_curpath, args[2]->status,
                                (timestamp-self->t)/1000000000);
                         self->t=0;}'
                     -o /var/tmp/nfs4-renew.log &

Let's catch all nfsv4 operations and their statuses on nfs server:

# cat /var/tmp/ss
#!/bin/bash

dtrace -q -n nfsv4:::op-open-downgrade-start,nfsv4:::op-secinfo-start,nfsv4:::op-readdir-start,nfsv4:::op-setclientid-start,
             nfsv4:::op-remove-start,nfsv4:::cb-recall-start,nfsv4:::op-rename-start,nfsv4:::op-commit-start,nfsv4:::op-savefh-start,
             nfsv4:::op-delegpurge-start,nfsv4:::op-openattr-start,nfsv4:::op-delegreturn-start,nfsv4:::op-putpubfh-start,
             nfsv4:::op-nverify-start,nfsv4:::op-verify-start,nfsv4:::op-setattr-start,nfsv4:::op-getattr-start,
             nfsv4:::op-open-confirm-start,nfsv4:::null-start,nfsv4:::op-readlink-start,nfsv4:::op-lockt-start,nfsv4:::op-locku-start,
             nfsv4:::op-putrootfh-start,nfsv4:::op-close-start,nfsv4:::op-access-start,nfsv4:::op-getfh-start,nfsv4:::op-read-start,
             nfsv4:::op-lock-start,nfsv4:::op-open-start,nfsv4:::op-link-start,nfsv4:::op-renew-start,nfsv4:::op-write-start,
             nfsv4:::op-restorefh-start,nfsv4:::op-lookupp-start,nfsv4:::op-release-lockowner-start,nfsv4:::compound-start,
             nfsv4:::op-putfh-start,nfsv4:::op-setclientid-confirm-start,nfsv4:::op-lookup-start,nfsv4:::op-create-start'
             /args[0]->ci_remote=="10.210.16.12"/
             {self->t=timestamp;
              printf("%Y %s %s %s\n", walltimestamp, probename, args[0]->ci_remote, args[0]->ci_local);}' \
              
          -n nfsv4:::op-open-downgrade-done,nfsv4:::op-readdir-done,nfsv4:::op-access-done,nfsv4:::op-lock-done,nfsv4:::op-link-done,
             nfsv4:::op-setclientid-done,nfsv4:::op-open-confirm-done,nfsv4:::op-open-done,nfsv4:::op-create-done,
             nfsv4:::op-secinfo-done,nfsv4:::op-locku-done,nfsv4:::op-close-done,nfsv4:::op-write-done,
             nfsv4:::op-delegreturn-done,nfsv4:::op-getattr-done,nfsv4:::op-verify-done,nfsv4:::op-getfh-done,
             nfsv4:::op-putfh-done,nfsv4:::op-nverify-done,nfsv4:::op-readlink-done,nfsv4:::op-renew-done,
             nfsv4:::op-savefh-done,nfsv4:::op-read-done,nfsv4:::compound-done,nfsv4:::cb-recall-done,
             nfsv4:::op-lookup-done,nfsv4:::op-restorefh-done,nfsv4:::op-release-lockowner-done,nfsv4:::op-lookupp-done,
             nfsv4:::op-openattr-done,nfsv4:::op-putpubfh-done,nfsv4:::op-putrootfh-done,
             nfsv4:::op-setclientid-confirm-done,nfsv4:::op-remove-done,nfsv4:::op-rename-done,nfsv4:::op-commit-done,
             nfsv4:::op-delegpurge-done,nfsv4:::op-setattr-done,nfsv4:::op-lockt-done'
             /self->t/
             {printf("%Y %s %s %s took: %d ret: %d\n", walltimestamp, probename, args[0]->ci_remote, args[0]->ci_local,
                     (timestamp-self->t)/1000000000, args[2]->status);
              self->t=0;}' \
              
    -n nfsv4:::null-done'/self->t/
            {printf("%Y %s %s %s took: %d\n", walltimestamp, probename, args[0]->ci_remote, args[0]->ci_local,
                    (timestamp-self->t)/1000000000);
             self->t=0;}' \
  -o /var/tmp/nfs4_all_ops.log

# /var/tmp/ss &

The list of running jobs on the nfs server is:

# jobs
[2]   Running                 tcpdump -i aggr1 -w /var/tmp/tcpdump.20191119.out host 10.210.16.12 &
[3]   Running                 /var/tmp/ss &
[4]-  Running                 dtrace -q -n nfsv4:::op-renew-start'{self->t=timestamp; printf("%Y %s client_id: 0x%x noi_xid: %u %s %d:%d %s\n", walltimestamp, probename, args[2]->clientid, args[1]->noi_xid, args[0]->ci_remote, args[1]->noi_cred->cr_uid, args[1]->noi_cred->cr_gid, args[1]->noi_curpath);}' -n nfsv4:::op-renew-done'/self->t/{printf("%Y %s noi_xid: %u %s %s %d:%d %s ret: %d took: %d seconds\n", walltimestamp, probename, args[1]->noi_xid, args[0]->ci_remote, args[0]->ci_local, args[1]->noi_cred->cr_uid, args[1]->noi_cred->cr_gid, args[1]->noi_curpath, args[2]->status, (timestamp-self->t)/1000000000);self->t=0;}' -o /var/tmp/nfs4-renew.log &  (wd: /var/log/smblog)
[5]+  Running                 dtrace -q -n fbt::rfs4_lease_expired:entry'{self->cp=args[0];}' -n fbt::rfs4_lease_expired:return'/arg1 && self->cp/{this->s=(struct sockaddr_in *)&self->cp->rc_addr; printf("%Y %s client_ip: %s client_id: %x forced_expire: %d last_access: %u dt: %d rfs4_lease_time: %d ret: %d\n", walltimestamp, probefunc, inet_ntop(this->s->sin_family, (void *)&this->s->sin_addr), self->cp->rc_clientid, self->cp->rc_forced_expire, self->cp->rc_last_access, (walltimestamp/1000000000)-self->cp->rc_last_access, `rfs4_lease_time, arg1);self->cp=0;stack()}' -n fbt::rfs4_lease_expired:return'/self->cp/{self->cp=0;}' -o /var/tmp/rfs4_lease_expired_failed.out &

These have been running without production traffic, so there is little going on there.
Still client needs to renew its nfsv4 session as long as the filesystem is mounted, etc.
First let's see how often the client issues RENEW operations and when it fails:

# grep "op-renew-done.*10.210.16.12" /var/tmp/nfs4-renew.log | less
...
Nov 19 17:30:01 op-renew-done noi_xid: 807596244 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:31:01 op-renew-done noi_xid: 824373460 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:31:01 op-renew-done noi_xid: 2357017293 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:32:02 op-renew-done noi_xid: 841150676 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:32:02 op-renew-done noi_xid: 2373794509 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:33:02 op-renew-done noi_xid: 857927892 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:33:02 op-renew-done noi_xid: 2390571725 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:34:02 op-renew-done noi_xid: 2407348941 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:34:02 op-renew-done noi_xid: 874705108 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:35:02 op-renew-done noi_xid: 2424126157 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:35:02 op-renew-done noi_xid: 891482324 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:36:02 op-renew-done noi_xid: 2440903373 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:36:02 op-renew-done noi_xid: 908259540 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:37:02 op-renew-done noi_xid: 925036756 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:37:02 op-renew-done noi_xid: 2457680589 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:38:02 op-renew-done noi_xid: 2474457805 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:38:02 op-renew-done noi_xid: 941813972 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:39:03 op-renew-done noi_xid: 958591188 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:39:03 op-renew-done noi_xid: 2491235021 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:40:57 op-renew-done noi_xid: 796801741 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 17:40:57 op-renew-done noi_xid: 813578957 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 17:40:57 op-renew-done noi_xid: 3491950804 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 17:40:57 op-renew-done noi_xid: 3508728020 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 17:41:02 op-renew-done noi_xid: 863910605 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 19 17:41:02 op-renew-done noi_xid: 3559059668 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
...

It's been issuing the RENEW ops at every 60s but then there is a gap of 114s (17:39:03 - 17:40:57) which is more than 90s requested by the server, so the server replies with 10011 (NFS4ERR_EXPIRED). If now an open() happened it would have failed as this client doesn't seem to handle NFS4ERR_RENEW returned by open. Notice that it hits the issue for both server IP addresses at the same time (both served by the same nfs server), that's why there are two RENEW calls. It then starts working again and it starts sending new RENEW calls regularly. It must have issued setclientid to fix the state though, let's check it in the other log with all nfs operations:

# less /var/tmp/nfs4_all_ops.log
...
Nov 19 17:38:02 op-renew-start 10.210.16.12 10.210.6.6
Nov 19 17:38:02 op-renew-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 17:39:03 compound-start 10.210.16.12 10.210.6.6
Nov 19 17:39:03 op-renew-start 10.210.16.12 10.210.6.6
Nov 19 17:39:03 op-renew-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 17:39:03 compound-start 10.210.16.12 10.210.6.5
Nov 19 17:39:03 op-renew-start 10.210.16.12 10.210.6.5
Nov 19 17:39:03 op-renew-done 10.210.16.12 10.210.6.5 took: 0 ret: 0

... many getattr, access, etc. operations

Nov 19 17:40:57 op-renew-start 10.210.16.12 10.210.6.5
Nov 19 17:40:57 op-renew-done 10.210.16.12 10.210.6.5 took: 0 ret: 10011
Nov 19 17:40:57 compound-start 10.210.16.12 10.210.6.5
Nov 19 17:40:57 op-renew-start 10.210.16.12 10.210.6.5
Nov 19 17:40:57 op-renew-done 10.210.16.12 10.210.6.5 took: 0 ret: 10011
Nov 19 17:40:57 compound-start 10.210.16.12 10.210.6.6
Nov 19 17:40:57 op-setclientid-confirm-start 10.210.16.12 10.210.6.6
Nov 19 17:40:57 compound-start 10.210.16.12 10.210.6.5
Nov 19 17:40:57 op-setclientid-confirm-start 10.210.16.12 10.210.6.5
Nov 19 17:40:57 op-setclientid-confirm-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 17:40:57 compound-start 10.210.16.12 10.210.6.6
Nov 19 17:40:57 op-renew-start 10.210.16.12 10.210.6.6
Nov 19 17:40:57 op-renew-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 19 17:40:57 compound-start 10.210.16.12 10.210.6.6
Nov 19 17:40:57 op-renew-start 10.210.16.12 10.210.6.6
Nov 19 17:40:57 op-renew-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 19 17:40:57 compound-start 10.210.16.12 10.210.6.6
Nov 19 17:40:57 op-setclientid-start 10.210.16.12 10.210.6.6
Nov 19 17:40:57 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 17:40:57 op-setclientid-confirm-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 17:40:57 compound-start 10.210.16.12 10.210.6.5
Nov 19 17:40:57 op-setclientid-start 10.210.16.12 10.210.6.5
Nov 19 17:40:57 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 17:41:02 compound-start 10.210.16.12 10.210.6.5
Nov 19 17:41:02 op-renew-start 10.210.16.12 10.210.6.5
Nov 19 17:41:02 op-renew-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 17:41:02 compound-start 10.210.16.12 10.210.6.6
Nov 19 17:41:02 op-renew-start 10.210.16.12 10.210.6.6
Nov 19 17:41:02 op-renew-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 17:42:03 compound-start 10.210.16.12 10.210.6.5
Nov 19 17:42:03 op-renew-start 10.210.16.12 10.210.6.5
Nov 19 17:42:03 op-renew-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
...

Yes, it did issue setclientid (for both 10.210.6.56 and .6).
Let's see how often it issues the setclientid:

# ggrep op-setclientid-done /var/tmp/nfs4_all_ops.log
Nov 19 13:11:00 op-setclientid-done 10.210.16.12 took: 0 ret: 0
Nov 19 13:40:55 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 14:40:56 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 15:11:05 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 15:40:54 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 15:40:54 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 16:40:48 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 16:40:48 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 17:40:57 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 17:40:57 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 18:40:55 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 18:40:55 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 19:40:57 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 19:40:57 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 20:11:03 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 20:11:03 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 20:40:51 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 20:40:51 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 21:10:49 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 21:10:49 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 22:40:59 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 22:40:59 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 23:11:08 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 23:11:08 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 19 23:40:58 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 19 23:40:58 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 00:41:06 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 00:41:06 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 01:11:02 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 01:11:02 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 02:10:54 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 02:10:54 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 03:10:59 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 03:10:59 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 03:41:06 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 03:41:06 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 04:11:15 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 04:11:15 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 04:41:00 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 04:41:00 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 05:11:03 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 05:11:03 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 05:40:59 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 05:40:59 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 06:11:09 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 06:11:09 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 06:41:05 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 06:41:05 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 07:11:05 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 07:11:05 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 07:40:58 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 07:40:58 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 08:10:49 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 08:10:49 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 10:41:18 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 10:41:18 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 11:11:23 op-setclientid-done 10.210.16.12 10.210.6.5 took: 0 ret: 0
Nov 20 11:11:23 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0

It issues the setclientid calls either every 1hr or at about every 30 minutes intervals.
Let's see if it corresponds to RENEW called which got 10011 (NFS4ERR_EXPIRED):

# grep "op-renew-done.*10.210.16.12.*10011" /var/tmp/nfs4-renew.log
...
Nov 19 13:11:00 op-renew-done noi_xid: 3380034253 10.210.16.12 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 13:11:00 op-renew-done noi_xid: 3363257037 10.210.16.12 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 13:40:55 op-renew-done noi_xid: 2205694669 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 13:40:55 op-renew-done noi_xid: 2222471885 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 14:40:56 op-renew-done noi_xid: 3866704589 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 14:40:56 op-renew-done noi_xid: 3883481805 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 15:11:05 op-renew-done noi_xid: 2709142221 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 15:11:05 op-renew-done noi_xid: 2725919437 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 15:40:54 op-renew-done noi_xid: 1232812749 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 15:40:54 op-renew-done noi_xid: 1249589965 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 15:40:54 op-renew-done noi_xid: 4196397268 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 15:40:54 op-renew-done noi_xid: 4213174484 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 16:40:48 op-renew-done noi_xid: 1713500372 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 16:40:48 op-renew-done noi_xid: 1696723156 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 16:40:48 op-renew-done noi_xid: 3162258125 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 16:40:48 op-renew-done noi_xid: 3179035341 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 17:40:57 op-renew-done noi_xid: 796801741 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 17:40:57 op-renew-done noi_xid: 813578957 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 17:40:57 op-renew-done noi_xid: 3491950804 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 17:40:57 op-renew-done noi_xid: 3508728020 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 18:40:55 op-renew-done noi_xid: 2407480013 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 18:40:55 op-renew-done noi_xid: 2424257229 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 18:40:55 op-renew-done noi_xid: 690286804 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 18:40:55 op-renew-done noi_xid: 673509588 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 19:40:57 op-renew-done noi_xid: 4018158285 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 19:40:57 op-renew-done noi_xid: 2149970132 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 19:40:57 op-renew-done noi_xid: 2166747348 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 19 19:40:57 op-renew-done noi_xid: 4034935501 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
...

It does. So every ~30 minutes or ~1hr the client doesn't issuew RENEW ontime and has to call SETCLIENTID. It seems that if during the window there is an OPEN issued it will get NFS4ERR_EXPIRED and fail as Linux doesn't seem to issue RENEW and then retry the OPEN. I think the file can't be already opened and attributes can't be cached to hit the issue.
Now the question is why Linux is late here to issue RENEW sometimes.

I re-enabled application and it failed, while stap on the nfs client recorded:

# less /var/tmp/failed_stats_and_opens.log
...
Wed Nov 20 16:41:16 2019 java[10562] uid: 10203 SyS_newlstat("/mnt/.../XXX.zip") = 0 : 0
Wed Nov 20 16:41:16 2019 java[10562] uid: 10203 SyS_newstat("/mnt/.../XXX.zip") = 0 : 0
Wed Nov 20 16:41:16 2019 java[10562] uid: 10203 SyS_newstat("/mnt/.../XXX.zip") = 0 : 0
Wed Nov 20 16:41:16 2019 java[10562] uid: 10203 SyS_newstat("/mnt/.../XXX.zip") = 0 : 0
Wed Nov 20 16:41:16 2019 java[10562] uid: 10203 SyS_open("/mnt/.../.zip") = -5 (EIO) : -5
...

From the debug log on the nfs client (for the 16:41:51 case):

Nov 20 16:41:51 xxx kernel: [702405.312031] NFS: open file(.../XXX.zip)
Nov 20 16:41:51 xxx kernel: [702405.312033] --> nfs_put_client({23})
Nov 20 16:41:51 xxx kernel: [702405.312051] --> nfs4_alloc_slot used_slots=0000 highest_used=4294967295 max_slots=1024
Nov 20 16:41:51 xxx kernel: [702405.312053] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
Nov 20 16:41:51 xxx kernel: [702405.312212] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
Nov 20 16:41:51 xxx kernel: [702405.312227] nfs4_map_errors could not handle NFSv4 error 10011
Nov 20 16:41:51 xxx kernel: [702405.312233] NFS: atomic_open(0:109/65275), XXX.zip
Nov 20 16:41:51 xxx kernel: [702405.312268] --> nfs_put_client({23})
Nov 20 16:41:51 xxx kernel: [702405.312297] --> nfs4_alloc_slot used_slots=0000 highest_used=4294967295 max_slots=1024
Nov 20 16:41:51 xxx kernel: [702405.312299] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
Nov 20 16:41:51 xxx kernel: [702405.312468] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
Nov 20 16:41:51 xxx kernel: [702405.312478] nfs4_map_errors could not handle NFSv4 error 10011

On the nfs server:

# grep '20 16:[34].*-done.*10.210.6.6' /var/tmp/nfs4_all_ops.log | less
...
2019 Nov 20 16:41:51 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
...
2019 Nov 20 16:41:51 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
...

Let's see renew operations:

# ggrep "op-renew-done.*10.210.16.12 10.210.6.6" /var/tmp/nfs4-renew.log | ggrep '20 16:[234]'
Nov 20 16:20:46 op-renew-done noi_xid: 2640048340 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:21:46 op-renew-done noi_xid: 2656825556 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:22:46 op-renew-done noi_xid: 2673602772 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:23:46 op-renew-done noi_xid: 2690379988 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:26:18 op-renew-done noi_xid: 3479630036 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:27:19 op-renew-done noi_xid: 3496407252 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:28:19 op-renew-done noi_xid: 3513184468 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:29:19 op-renew-done noi_xid: 3529961684 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:30:19 op-renew-done noi_xid: 3546738900 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:31:19 op-renew-done noi_xid: 3563516116 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:32:19 op-renew-done noi_xid: 3580293332 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:33:19 op-renew-done noi_xid: 3597070548 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:34:20 op-renew-done noi_xid: 3613847764 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:35:20 op-renew-done noi_xid: 3630624980 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:36:20 op-renew-done noi_xid: 3647402196 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:37:20 op-renew-done noi_xid: 3664179412 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:38:20 op-renew-done noi_xid: 3680956628 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:42:16 op-renew-done noi_xid: 3932811476 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 20 16:42:16 op-renew-done noi_xid: 3949588692 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 10011 took: 0 seconds
Nov 20 16:44:46 op-renew-done noi_xid: 1438445780 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:45:46 op-renew-done noi_xid: 1455222996 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:46:46 op-renew-done noi_xid: 1472000212 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:47:46 op-renew-done noi_xid: 1488777428 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:48:46 op-renew-done noi_xid: 1505554644 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds
Nov 20 16:49:46 op-renew-done noi_xid: 1522331860 10.210.16.12 10.210.6.6 60001:60001 <unknown> ret: 0 took: 0 seconds

# grep '20 1[56]:.*setclie.*10.210.6.6' /var/tmp/nfs4_all_ops.log
Nov 20 15:11:09 op-setclientid-start 10.210.16.12 10.210.6.6
Nov 20 15:11:09 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 15:11:09 op-setclientid-confirm-start 10.210.16.12 10.210.6.6
Nov 20 15:11:09 op-setclientid-confirm-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 15:40:49 op-setclientid-start 10.210.16.12 10.210.6.6
Nov 20 15:40:49 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 15:40:49 op-setclientid-confirm-start 10.210.16.12 10.210.6.6
Nov 20 15:40:49 op-setclientid-confirm-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 16:42:16 op-setclientid-start 10.210.16.12 10.210.6.6
Nov 20 16:42:16 op-setclientid-done 10.210.16.12 10.210.6.6 took: 0 ret: 0
Nov 20 16:42:16 op-setclientid-confirm-start 10.210.16.12 10.210.6.6
Nov 20 16:42:16 op-setclientid-confirm-done 10.210.16.12 10.210.6.6 took: 0 ret: 0

And the client recovery from its debug log:

Nov 20 16:42:16 xxx kernel: [702430.126654] nfs4_renew_state: start
Nov 20 16:42:16 xxx kernel: [702430.126677] nfs4_renew_state: done
Nov 20 16:42:16 xxx kernel: [702430.245186] nfs4_schedule_lease_recovery: scheduling lease recovery for server xxx
Nov 20 16:42:16 xxx kernel: [702430.247174] nfs4_schedule_state_renewal: requeueing work. Lease period = 30
Nov 20 16:42:16 xxx kernel: [702430.247177] --> nfs_put_client({24})
Nov 20 16:42:16 xxx kernel: [702430.247817] nfs4_recovery_handle_error: handled error -10011 for server xxx
Nov 20 16:42:16 xxx kernel: [702430.247821] NFS call  setclientid auth=RPCSEC_GSS, 'Linux NFSv4.0 xxx/10.210.6.6'
Nov 20 16:42:16 xxx kernel: [702430.248233] NFS reply setclientid: 0
Nov 20 16:42:16 xxx kernel: [702430.248235] NFS call  setclientid_confirm auth=RPCSEC_GSS, (client ID ad23bf2032c90f)
Nov 20 16:42:16 xxx kernel: [702430.257473] NFS reply setclientid_confirm: 0

This is a gap of 236s (well over the 90s) between 16:38:20 and 16:42:16 and all open errors happened only during this exact time period:

# grep '20 1[56]:.*open-done.*10.210.6.6' /var/tmp/nfs4_all_ops.log | grep 10011
Nov 20 16:40:36 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:40:36 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:15 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:15 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:16 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:16 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:17 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:16 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:17 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:16 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:16 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:16 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:17 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:17 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:22 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:22 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:22 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:22 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:30 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:30 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:32 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:31 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011
Nov 20 16:41:32 op-open-done 10.210.16.12 10.210.6.6 took: 0 ret: 10011

nfs4renewd.c /nfs4_renew_state() prints “failed to call renewd. Reason: lease not expired” and later dprintk("%s: requeueing work. Lease period = %ld\n", func, (timeout + HZ - 1) / HZ);
Prints 57s, this suggests 3s before fsinfo() was run on the server, if solaris wouldn’t renew lease on fsinfo() then RENEW wouldn’t be send by nfs4_renew_state() and the client would end up without a lease and would have to renew.
I used stap to confirm it.

nfs4proc.c /nfs4_do_fsinfo() nfs4_do_open() -> _nfs4_do_open() -> nfs4_recover_expired_lease() -> nfs4_client_recover_expired_lease() -> test_bit(NFS4CLNT_LEASE_EXPIRED) 10 times in a loop, if not set return EIO

The only thing which will set the bit is nfs4renewd.c/nfs4_renew_state() but if something else is modifying clp->cl_last_renewal and casing nfs4_renew_state() not to set the bit, then open() won’t recover. We do see the dprintk("%s: requeueing work. Lease period = %ld\n"… printed by nfs4_schedule_state_renewal() with value of 57 (and others) which means 3s before something updated cl_last_renewal.
Now nfs4_do_fsinfo() calls nfs4_set_lease_period().

static struct nfs4_state *nfs4_do_open(struct inode *dir,
                                        struct nfs_open_context *ctx,
                                        int flags,
                                        struct iattr *sattr,
                                        struct nfs4_label *label,
                                        int *opened)
{
        struct nfs_server *server = NFS_SERVER(dir);
        struct nfs4_exception exception = { };
        struct nfs4_state *res;
        int status;

        do {
                status = _nfs4_do_open(dir, ctx, flags, sattr, label, opened);
                res = ctx->state;
                trace_nfs4_open_file(ctx, flags, status);
                if (status == 0)
                        break;
…
        return res;
}

static int _nfs4_do_open(struct inode *dir,
                        struct nfs_open_context *ctx,
                        int flags,
                        struct iattr *sattr,
                        struct nfs4_label *label,
                        int *opened)
{
…
        status = nfs4_recover_expired_lease(server);
        if (status != 0)
                goto err_put_state_owner;
…
err_put_state_owner:
        nfs4_put_state_owner(sp);
out_err:
        return status;
}

static int nfs4_recover_expired_lease(struct nfs_server *server)
{
        return nfs4_client_recover_expired_lease(server->nfs_client);
}

nfs4state.c:
int nfs4_client_recover_expired_lease(struct nfs_client *clp)
{
        unsigned int loop;
        int ret;

        for (loop = NFS4_MAX_LOOP_ON_RECOVER; loop != 0; loop--) {
                ret = nfs4_wait_clnt_recover(clp);
                if (ret != 0)
                        break;
                if (!test_bit(NFS4CLNT_LEASE_EXPIRED, &clp->cl_state) &&
                    !test_bit(NFS4CLNT_CHECK_LEASE,&clp->cl_state))
                        break;
                nfs4_schedule_state_manager(clp);
                ret = -EIO;
        }
        return ret;
}

$ grep -r NFS4_MAX_LOOP_ON_RECOVER . | grep define
./nfs4_fs.h:#define NFS4_MAX_LOOP_ON_RECOVER (10)

nfs4renewd.c:
nfs4_renew_state(struct work_struct *work)
{
        const struct nfs4_state_maintenance_ops *ops;
        struct nfs_client *clp =
                container_of(work, struct nfs_client, cl_renewd.work);
        struct rpc_cred *cred;
        long lease;
        unsigned long last, now;
        unsigned renew_flags = 0;

        ops = clp->cl_mvops->state_renewal_ops;
        dprintk("%s: start\n", __func__);

        if (test_bit(NFS_CS_STOP_RENEW, &clp->cl_res_state))
                goto out;

        spin_lock(&clp->cl_lock);
        lease = clp->cl_lease_time;
        last = clp->cl_last_renewal;
        now = jiffies;
        /* Are we close to a lease timeout? */
        if (time_after(now, last + lease/3))
                renew_flags |= NFS4_RENEW_TIMEOUT;
        if (nfs_delegations_present(clp))
                renew_flags |= NFS4_RENEW_DELEGATION_CB;

        if (renew_flags != 0) {
                cred = ops->get_state_renewal_cred_locked(clp);
                spin_unlock(&clp->cl_lock);
                if (cred == NULL) {
                        if (!(renew_flags & NFS4_RENEW_DELEGATION_CB)) {
                                set_bit(NFS4CLNT_LEASE_EXPIRED, &clp->cl_state);
                                goto out;
                        }
                        nfs_expire_all_delegations(clp);
                } else {
                        int ret;

                        /* Queue an asynchronous RENEW. */
                        ret = ops->sched_state_renewal(clp, cred, renew_flags);
                        put_rpccred(cred);
                        switch (ret) {
                        default:
                                goto out_exp;
                        case -EAGAIN:
                        case -ENOMEM:
                                break;
                        }
                }
        } else {
                dprintk("%s: failed to call renewd. Reason: lease not expired \n",
                                __func__);
                spin_unlock(&clp->cl_lock);
        }
        nfs4_schedule_state_renewal(clp);
out_exp:
        nfs_expire_unreferenced_delegations(clp);
out:
        dprintk("%s: done\n", __func__);
}


void
nfs4_schedule_state_renewal(struct nfs_client *clp)
{
        long timeout;

        spin_lock(&clp->cl_lock);
        timeout = (2 * clp->cl_lease_time) / 3 + (long)clp->cl_last_renewal
                - (long)jiffies;
        if (timeout < 5 * HZ)
                timeout = 5 * HZ;
        dprintk("%s: requeueing work. Lease period = %ld\n",
                        __func__, (timeout + HZ - 1) / HZ);
        mod_delayed_work(system_wq, &clp->cl_renewd, timeout);
        set_bit(NFS_CS_RENEWD, &clp->cl_res_state);
        spin_unlock(&clp->cl_lock);
}


/**
 * nfs4_set_lease_period - Sets the lease period on a nfs_client
 *
 * @clp: pointer to nfs_client
 * @lease: new value for lease period
 * @lastrenewed: time at which lease was last renewed
 */
void nfs4_set_lease_period(struct nfs_client *clp,
                unsigned long lease,
                unsigned long lastrenewed)
{
        spin_lock(&clp->cl_lock);
        clp->cl_lease_time = lease;
        clp->cl_last_renewal = lastrenewed;
        spin_unlock(&clp->cl_lock);

        /* Cap maximum reconnect timeout at 1/2 lease period */
        rpc_cap_max_reconnect_timeout(clp->cl_rpcclient, lease >> 1);
}

static int nfs4_do_fsinfo(struct nfs_server *server, struct nfs_fh *fhandle, struct nfs_fsinfo *fsinfo)
{
        struct nfs4_exception exception = { };
        unsigned long now = jiffies;
        int err;

        do {
                err = _nfs4_do_fsinfo(server, fhandle, fsinfo);
                trace_nfs4_fsinfo(server, fhandle, fsinfo->fattr, err);
                if (err == 0) {
                        nfs4_set_lease_period(server->nfs_client,
                                        fsinfo->lease_time * HZ,
                                        now);
                        break;
                }
                err = nfs4_handle_exception(server, err, &exception);
        } while (exception.retry);
        return err;
}

static int _nfs4_do_fsinfo(struct nfs_server *server, struct nfs_fh *fhandle,
                struct nfs_fsinfo *fsinfo)
{
        struct nfs4_fsinfo_arg args = {
                .fh = fhandle,
                .bitmask = server->attr_bitmask,
        };
        struct nfs4_fsinfo_res res = {
                .fsinfo = fsinfo,
        };
        struct rpc_message msg = {
                .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_FSINFO],
                .rpc_argp = &args,
                .rpc_resp = &res,
        };

        return nfs4_call_sync(server->client, server, &msg, &args.seq_args, &res.seq_res, 0);
}

Many nfs4 ops reset lease as well, so they assume on successful op server would do the same.
For example:

static void do_renew_lease(struct nfs_client *clp, unsigned long timestamp)
{
        spin_lock(&clp->cl_lock);
        if (time_before(clp->cl_last_renewal,timestamp))
                clp->cl_last_renewal = timestamp;
        spin_unlock(&clp->cl_lock);
}


static void renew_lease(const struct nfs_server *server, unsigned long timestamp)
{
        struct nfs_client *clp = server->nfs_client;

        if (!nfs4_has_session(clp))
                do_renew_lease(clp, timestamp);
}
}

static void nfs4_open_confirm_done(struct rpc_task *task, void *calldata)
{
        struct nfs4_opendata *data = calldata;

        nfs40_sequence_done(task, &data->c_res.seq_res);

        data->rpc_status = task->tk_status;
        if (data->rpc_status == 0) {
                nfs4_stateid_copy(&data->o_res.stateid, &data->c_res.stateid);
                nfs_confirm_seqid(&data->owner->so_seqid, 0);
                renew_lease(data->o_res.server, data->timestamp);
                data->rpc_done = 1;
        }
}

static void nfs4_open_done(struct rpc_task *task, void *calldata)
{
        struct nfs4_opendata *data = calldata;

        data->rpc_status = task->tk_status;

        if (!nfs4_sequence_process(task, &data->o_res.seq_res))
                return;

        if (task->tk_status == 0) {
                if (data->o_res.f_attr->valid & NFS_ATTR_FATTR_TYPE) {
                        switch (data->o_res.f_attr->mode & S_IFMT) {
                        case S_IFREG:
                                break;
                        case S_IFLNK:
                                data->rpc_status = -ELOOP;
                                break;
                        case S_IFDIR:
                                data->rpc_status = -EISDIR;
                                break;
                        default:
                                data->rpc_status = -ENOTDIR;
                        }
                }
                renew_lease(data->o_res.server, data->timestamp);
                if (!(data->o_res.rflags & NFS4_OPEN_RESULT_CONFIRM))
                        nfs_confirm_seqid(&data->owner->so_seqid, 0);
        }
        data->rpc_done = 1;
}

Let’s see what is calling nfs4_do_fsinfo():

# cat ./nfs4_do_fsinfo.stp
#!/usr/bin/env stap


/*
 module("nfsv4").function("nfs4_do_fsinfo@fs/nfs/nfs4proc.c:4504") $server:struct nfs_server* $fhandle:struct nfs_fh* $fsinfo:struct nfs_fsinfo* $exception:struct nfs4_exception
*/


probe  module("nfsv4").function("nfs4_do_fsinfo") {

        printf("%s %s proc: %s[%d] uid: %d\n", ctime(gettimeofday_s()), probefunc(), execname(), pid(), uid())
        print_backtrace()
        exit()
}

# ./nfs4_do_fsinfo.stp

WARNING: Missing unwind data for a module, rerun with 'stap -d nfs'
Thu Nov 21 16:39:53 2019 nfs4_do_fsinfo proc: ls[18701] uid: 10006
 0xffffffffc05de260 : nfs4_do_fsinfo+0x0/0x120 [nfsv4]
 0xffffffffc05de3b1 : nfs4_proc_fsinfo+0x31/0x60 [nfsv4]
 0xffffffffc05924e0 [nfs]
 0xffffffffc0592b43 [nfs]
 0xffffffffc059f098 [nfs]
 0xffffffff9104d1fe
 0xffffffff9106b377
 0xffffffff9106b454
 0xffffffffc05ab014 [nfs]
 0xffffffffc05fb740 : nfs4_submount+0x90/0x830 [nfsv4]
 0xffffffffc05ab21a [nfs]
 0xffffffff91053a69
 0xffffffff910549ad
 0xffffffff91058483
 0xffffffff9105878b
 0xffffffff91058cea
 0xffffffff9105954b
 0xffffffff9105b497
 0xffffffff9105b501
 0xffffffff9104e1e3


# ./nfs4_do_fsinfo.stp
WARNING: Missing unwind data for a module, rerun with 'stap -d nfs'
Thu Nov 21 17:10:05 2019 nfs4_do_fsinfo proc: ls[29463] uid: 10006
 0xffffffffc05de260 : nfs4_do_fsinfo+0x0/0x120 [nfsv4]
 0xffffffffc05de3b1 : nfs4_proc_fsinfo+0x31/0x60 [nfsv4]
 0xffffffffc05924e0 [nfs]
 0xffffffffc0592b43 [nfs]
 0xffffffffc059f098 [nfs]
 0xffffffff9104d1fe
 0xffffffff9106b377
 0xffffffff9106b454
 0xffffffffc05ab014 [nfs]
 0xffffffffc05fb740 : nfs4_submount+0x90/0x830 [nfsv4]
 0xffffffffc05ab21a [nfs]
 0xffffffff91053a69
 0xffffffff910549ad
 0xffffffff91058483
 0xffffffff9105878b
 0xffffffff91058cea
 0xffffffff9105954b
 0xffffffff9105b497
 0xffffffff9105b501
 0xffffffff9104e1e3

Let’s get some more info:

# cat ./nfs4_do_fsinfo.stp
#!/usr/bin/env stap


/*
 module("nfsv4").function("nfs4_do_fsinfo@fs/nfs/nfs4proc.c:4504") $server:struct nfs_server* $fhandle:struct nfs_fh* $fsinfo:struct nfs_fsinfo* $exception:struct nfs4_exception
*/



probe  module("nfsv4").function("nfs4_do_fsinfo") {

        printf("%s %s proc: %s[%d] ppid: %d args: %s cwd: %s uid: %d\n", ctime(gettimeofday_s()), probefunc(), execname(), pid(),
               ppid(), cmdline_str(), fullpath_struct_path(task_cwd_path(task_current())), uid())
        print_backtrace()
        exit()
}

# ./nfs4_do_fsinfo.stp
WARNING: Missing unwind data for a module, rerun with 'stap -d nfs'
Thu Nov 21 17:40:05 2019 nfs4_do_fsinfo proc: ls[9275] ppid: 9273 args: ls /mnt/.../ cwd: /... uid: 10006
 0xffffffffc05de260 : nfs4_do_fsinfo+0x0/0x120 [nfsv4]
 0xffffffffc05de3b1 : nfs4_proc_fsinfo+0x31/0x60 [nfsv4]
 0xffffffffc05924e0 [nfs]
 0xffffffffc0592b43 [nfs]
 0xffffffffc059f098 [nfs]
 0xffffffff9104d1fe
 0xffffffff9106b377
 0xffffffff9106b454
 0xffffffffc05ab014 [nfs]
 0xffffffffc05fb740 : nfs4_submount+0x90/0x830 [nfsv4]
 0xffffffffc05ab21a [nfs]
 0xffffffff91053a69
 0xffffffff910549ad
 0xffffffff91058483
 0xffffffff9105878b
 0xffffffff91058cea
 0xffffffff9105954b
 0xffffffff9105b497
 0xffffffff9105b501
 0xffffffff9104e1e3

Notice that fsifno is actually a set of GETATTR operations in nfsv4:

/*
 * FSINFO request
 */
static void nfs4_xdr_enc_fsinfo(struct rpc_rqst *req, struct xdr_stream *xdr,
                                struct nfs4_fsinfo_arg *args)
{
        struct compound_hdr hdr = {
                .minorversion = nfs4_xdr_minorversion(&args->seq_args),
        };

        encode_compound_hdr(xdr, req, &hdr);
        encode_sequence(xdr, &args->seq_args, &hdr);
        encode_putfh(xdr, args->fh, &hdr);
        encode_fsinfo(xdr, args->bitmask, &hdr);
        encode_nops(&hdr);
}

static void encode_fsinfo(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
{
        encode_getattr_three(xdr,
                             bitmask[0] & nfs4_fsinfo_bitmap[0],
                             bitmask[1] & nfs4_fsinfo_bitmap[1],
                             bitmask[2] & nfs4_fsinfo_bitmap[2],
                             hdr);
}

Let's double check that the nfs4_do_fsinfo() is caused by vfs/fstat() call or similar by ls:

# umount /mnt/...
$ ./sh/checkXXXMounts.sh
Checking /mnt/...
...

# stap -g -d/bin/ls --ldd --all-modules ./nfs4_do_fsinfo.stp
WARNING: missing unwind/symbol data for module 'stap_6d2b184a53eae72ccbc95adec92f2d3c_25253'
Fri Nov 22 10:32:32 2019 nfs4_do_fsinfo proc: ls[27494] ppid: 27492 args: ls /mnt/... cwd: /... uid: 10006
 0xffffffffc05de260 : nfs4_do_fsinfo+0x0/0x120 [nfsv4]
 0xffffffffc05de3b1 : nfs4_proc_fsinfo+0x31/0x60 [nfsv4]
 0xffffffffc05924e0 : nfs_probe_fsinfo+0xc0/0x660 [nfs]
 0xffffffffc0592b43 : nfs_clone_server+0xc3/0x1f0 [nfs]
 0xffffffffc059f098 : nfs_xdev_mount+0x98/0x130 [nfs]
 0xffffffff9104d1fe : mount_fs+0x3e/0x1b0 [kernel]
 0xffffffff9106b377 : vfs_kern_mount+0x67/0x110 [kernel]
 0xffffffff9106b454 : vfs_submount+0x34/0x40 [kernel]
 0xffffffffc05ab014 : nfs_do_submount+0xb4/0x160 [nfs]
 0xffffffffc05fb740 : nfs4_submount+0x90/0x830 [nfsv4]
 0xffffffffc05ab21a : nfs_d_automount+0xca/0x1c0 [nfs]
 0xffffffff91053a69 : follow_managed+0x149/0x310 [kernel]
 0xffffffff910549ad : lookup_fast+0x12d/0x230 [kernel]
 0xffffffff91058483 : link_path_walk+0x1b3/0x8b0 [kernel]
 0xffffffff9105878b : link_path_walk+0x4bb/0x8b0 [kernel]
 0xffffffff91058cea : path_lookupat+0x7a/0x8b0 [kernel]
 0xffffffff9105954b : filename_lookup+0x2b/0xc0 [kernel]
 0xffffffff9105b497 : user_path_at_empty+0x67/0xc0 [kernel]
 0xffffffff9105b501 : user_path_at+0x11/0x20 [kernel]
 0xffffffff9104e1e3 : vfs_fstatat+0x63/0xc0 [kernel]
-------
 0x7fd54c693275 : __xstat64+0x15/0x50 [/usr/lib64/libc-2.17.so]
 0x408841 : gobble_file.constprop.50+0x8b1/0xeb0 [/usr/bin/ls]
 0x402e41 : main+0x531/0x1a14 [/usr/bin/ls]
 0x7fd54c5c6505 : __libc_start_main+0xf5/0x1c0 [/usr/lib64/libc-2.17.so]
 0x40434d : _start+0x29/0x2c [/usr/bin/ls]
-------

Yes, it is. So the
btw: just running ls is enough to trigger it after data is no longer cached or better unmount it first:

# umount /mnt/...
# ls -l /mnt/... >/dev/null

# stap -g -d/bin/ls --ldd --all-modules ./nfs4_do_fsinfo.stp
Fri Nov 22 10:37:17 2019 nfs4_do_fsinfo proc: ls[29995] ppid: 13019 args: ls --color=auto -l /mnt/... cwd: /... uid: 0
 0xffffffffc05de260 : nfs4_do_fsinfo+0x0/0x120 [nfsv4]
 0xffffffffc05de3b1 : nfs4_proc_fsinfo+0x31/0x60 [nfsv4]
 0xffffffffc05924e0 : nfs_probe_fsinfo+0xc0/0x660 [nfs]
 0xffffffffc0592b43 : nfs_clone_server+0xc3/0x1f0 [nfs]
 0xffffffffc059f098 : nfs_xdev_mount+0x98/0x130 [nfs]
 0xffffffff9104d1fe : mount_fs+0x3e/0x1b0 [kernel]
 0xffffffff9106b377 : vfs_kern_mount+0x67/0x110 [kernel]
 0xffffffff9106b454 : vfs_submount+0x34/0x40 [kernel]
 0xffffffffc05ab014 : nfs_do_submount+0xb4/0x160 [nfs]
 0xffffffffc05fb740 : nfs4_submount+0x90/0x830 [nfsv4]
 0xffffffffc05ab21a : nfs_d_automount+0xca/0x1c0 [nfs]
 0xffffffff91053a69 : follow_managed+0x149/0x310 [kernel]
 0xffffffff910549ad : lookup_fast+0x12d/0x230 [kernel]
 0xffffffff91058ddd : path_lookupat+0x16d/0x8b0 [kernel]
 0xffffffff9105954b : filename_lookup+0x2b/0xc0 [kernel]
 0xffffffff9105b497 : user_path_at_empty+0x67/0xc0 [kernel]
 0xffffffff9105b501 : user_path_at+0x11/0x20 [kernel]
 0xffffffff9104e1e3 : vfs_fstatat+0x63/0xc0 [kernel]
 0xffffffff9104e601 : SYSC_newlstat+0x31/0x60 [kernel]

So what happens here is that Linux will unmount cross-mountpoints after /proc/sys/fs/nfs/nfs_mountpoint_timeout which on this server has the default value of 500. Next time anything triggers mounting a cross-mouint linux will send fsinfo() call (nfs4_do_fsinfo()).
For now, assuming the nfs4_do_fsinfo() only gets called on mounting cross-mount (not necesarilly true), to test a possible workaround let's modify linux so it wouldn't unmount sub-mounts for 7 days - on linux client:

# bc
60*60*24*7
604800

# echo 604800 > /proc/sys/fs/nfs/nfs_mountpoint_timeout

# date
Fri 22 Nov 12:23:27 GMT 2019

Also running:

# stap -g --ldd --all-modules --suppress-handler-errors -o /var/tmp/nfs4_do_fsinfo.log ./nfs4_do_fsinfo.stp  &
[6] 10295

I've managed to reproduce the issue.

# cat ./nfs4_renew_state.stp
#!/bin/stap


/*
 * module("nfsv4").function("nfs4_renew_state@fs/nfs/nfs4renewd.c:55") $work:struct work_struct* $clp:struct nfs_client* $renew_flags:unsigned int $__func__:char[] const
*/

probe  module("nfsv4").function("nfs4_renew_state") {

        printf("%s %s least_time: %d secs_to_renew: %d server: %s\n", ctime(gettimeofday_s()), probefunc(), 
               $clp->cl_lease_time/HZ(), (jiffies()-$clp->cl_last_renewal)/HZ(), kernel_string($clp->cl_hostname))
}

# ./nfs4_renew_state.stp | grep xxx
Fri Nov 22 15:40:41 2019 nfs4_renew_state least_time: 90 secs_to_renew: 60 server: xxx
Fri Nov 22 15:41:42 2019 nfs4_renew_state least_time: 90 secs_to_renew: 5 server: xxx
Fri Nov 22 15:42:36 2019 nfs4_renew_state least_time: 90 secs_to_renew: 60 server: xxx

Few seconds before 15:40:41+60s I did:

# umount /mnt/...; ls -l /mnt/.../
ls: cannot open directory /mnt/.../: Permission denied

So now Linux is not going to send RENEW and assume its lease got extended.
At 15:40:41+90s solaris server will assume the client's lease expired, so if we call open between 15:42:11 - 15:42:36 then open should fail:

$ date;strace -e trace=open cat /mnt/.../xxx >/dev/null
Fri 22 Nov 15:42:13 GMT 2019
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/mnt/.../xxx", O_RDONLY) = -1 EIO (Input/output error)
...

$ date;strace -e trace=open cat /mnt/.../xxx >/dev/null
Fri 22 Nov 15:42:38 GMT 2019
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/mnt/.../xxx", O_RDONLY) = 3
+++ exited with 0 +++

and it did fail, while when re-run again after linux sent renew then it worked.
Let's confirm solaris did respond with 10011:

# ggrep "Nov 22 .*-done.*10.210.16.12" /var/tmp/nfs4-renew.log | grep "ret: 10011" | tail
...
2019 Nov 22 15:42:36 op-renew-done noi_xid: 2258385869 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds
2019 Nov 22 15:42:36 op-renew-done noi_xid: 2275163085 10.210.16.12 10.210.6.5 60001:60001 <unknown> ret: 10011 took: 0 seconds

# grep 15:42:13 /var/tmp/nfs4_all_ops.log | grep 10011
2019 Nov 22 15:42:13 op-open-done 10.210.16.12 10.210.6.5 took: 0 ret: 10011

It did.

I've also managed to reproduce the same issue against Linux NFS server.

# cat ./nfs4_renew_state.stp
#!/bin/stap


/*
 * module("nfsv4").function("nfs4_renew_state@fs/nfs/nfs4renewd.c:55") $work:struct work_struct* $clp:struct nfs_client* $renew_flags:unsigned int $__func__:char[] const
*/

probe  module("nfsv4").function("nfs4_renew_state") {

        printf("%s %s least_time: %d secs_to_renew: %d server: %s\n", ctime(gettimeofday_s()), probefunc(), 
               $clp->cl_lease_time/HZ(), (jiffies()-$clp->cl_last_renewal)/HZ(), kernel_string($clp->cl_hostname))
}

# ./nfs4_renew_state.stp | grep xxx
Mon Nov 25 13:49:36 2019 nfs4_renew_state least_time: 90 secs_to_renew: 40 server: xxx
Mon Nov 25 13:50:36 2019 nfs4_renew_state least_time: 90 secs_to_renew: 60 server: xxx
Mon Nov 25 13:51:36 2019 nfs4_renew_state least_time: 90 secs_to_renew: 3 server: xxx
Mon Nov 25 13:52:33 2019 nfs4_renew_state least_time: 90 secs_to_renew: 60 server: xxx
Mon Nov 25 13:52:38 2019 nfs4_renew_state least_time: 90 secs_to_renew: 65 server: xxx
Mon Nov 25 13:53:38 2019 nfs4_renew_state least_time: 90 secs_to_renew: 60 server: xxx
^C

At 13:51:32 (4 seconds before client was to send RENEW operation) I run:

# date; umount /mnt/...; ls -l /mnt/.../
Mon 25 Nov 13:51:32 GMT 2019
total 7
-rw-r--r-- 1 root root 5 Nov 25 13:48 test_f1

Then at 13:52:08, 2 seconds after the client's lease expired from server point of view (13:50:36 + 90s = 13:52:06), I run:

$ date;strace -e trace=open cat /mnt/.../test_f1 >/dev/null
Mon 25 Nov 13:52:08 GMT 2019
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/mnt/.../test_f1", O_RDONLY) = -1 EIO (Input/output error)
...

Then, at 13:52:33 (13:51:36 + 57s), the client send RENEW operation, and had to recover via SETCLIENTID, etc. and now open() works fine:

$ date;strace -e trace=open cat /mnt/.../test_f1 >/dev/null
Mon 25 Nov 13:53:39 GMT 2019
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/mnt/.../test_f1", O_RDONLY) = 3
+++ exited with 0 +++

The Linux nfs server behaves the same as Solaris nfs server.
The linux is CentOS 7.4 running 3.10.0-693.11.6.el7.x86_64.
Also reproduced against Solaris 11.4.15.5.0 NFS server.

A workaround on Linux nfs client hosts is to increase sysctl/fs.nfs.nfs_mountpoint_timeout from 500 (default) to max allowed value: 2147483.

I couldn't reproduce the issue against NetApp filers though, some clarification from NetApp support:

Hi Robert,
 
I have gone through this case and your request for explanation on why ONTAP is not throwing “NFS4ERR_EXPIRED” when the lease is expired while other servers does that.
 
Why the Client did not send the RENEW call is client’s behavior and ONTAP cannot control that.
Now on why ONTAP is not throwing “NFS4ERR_EXPIRED” even after lease is expired (30 secs), I believe the RFC has the answer for that.
 
This section explains the behavior of other servers you have tested.
 
https://tools.ietf.org/html/rfc7530#section-9.6.3
9.6.3.  Network Partitions and Recovery
If this occurs, the server may cancel
the lease and free all locks held for the client.
 
The following section explains the ONTAP’s behavior.
 
https://tools.ietf.org/html/rfc7530#section-9.6.3.1
9.6.3.1.  Courtesy Locks
As a courtesy to the client or as an optimization, the server may
continue to hold locks, including delegations, on behalf of a client
for which recent communication has extended beyond the lease period,
delaying the cancellation of the lease.
 
This section clarifies that both the behaviors can be implemented.
 
https://tools.ietf.org/html/rfc7530#section-9.6.3.2
9.6.3.2.  Lease Cancellation
 
As a result of lease expiration, leases may be canceled, either
immediately upon expiration or subsequently, depending on the
occurrence of a conflicting lock or extension of the period of
partition beyond what the server will tolerate.
 
I believe ONTAP is behaving according to the RFC and in more optimized way.
In that case, please let me know if you need any further explanation from our side on the same.
...

Hey Robert,
 
We tested in the lab thoroughly to check for how long we keep the courtesy locks.
 
The courtesy locks will be kept for 2*lease time, if there is no conflicting request from any other client for the same file.
Please note that it might take some more time (40-50 secs) to clear those states from memory as the thread runs periodically to clean that.
 
Yes, they are tunable. As the time depends on lease-time, you can tweak that for a particular NFS server to control the courtesy lock time.
...

Knowing now more how ONTAP behaves I re-created the issue as well, although the behavior is different.
It is easier to cause do_fsinfo() every 10s as it will permanently stop linux client from sending RENEW operation.

# while [ 1 ] ; do date; umount /mnt/...; ls -l /mnt/.../; sleep 10; done

Few minutes later I run:

$ date; strace -e trace=open cat /mnt/.../f1
Fri  6 Dec 10:52:53 GMT 2019
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/mnt/.../f1", O_RDONLY) = 3
test
+++ exited with 0 +++

Notice that the open did not fail. Network dump:

3.656314120 10.210.16.12 -> 10.210.196.134 NFS 322 V4 Call OPEN DH: 0x6d8f1235/f1
3.861180170 10.210.196.134 -> 10.210.16.12 NFS 150 V4 Reply (Call In 34) OPEN Status: NFS4ERR_STALE_CLIENTID
3.861352838 10.210.16.12 -> 10.210.196.134 NFS 230 V4 Call RENEW CID: 0x5b6f
4.066210651 10.210.196.134 -> 10.210.16.12 NFS 182 V4 Reply (Call In 36) RENEW Status: NFS4ERR_STALE_CLIENTID
4.066357237 10.210.16.12 -> 10.210.196.134 NFS 338 V4 Call SETCLIENTID
4.271134903 10.210.196.134 -> 10.210.16.12 NFS 198 V4 Reply (Call In 38) SETCLIENTID
4.271256568 10.210.16.12 -> 10.210.196.134 NFS 238 V4 Call SETCLIENTID_CONFIRM
4.476085500 10.210.196.134 -> 10.210.16.12 NFS 182 V4 Reply (Call In 40) SETCLIENTID_CONFIRM
4.476296996 10.210.16.12 -> 10.210.196.134 NFS 322 V4 Call OPEN DH: 0x6d8f1235/f1
4.681334155 10.210.196.134 -> 10.210.16.12 NFS 470 V4 Reply (Call In 44) OPEN StateID: 0x900a
4.683485544 10.210.16.12 -> 10.210.196.134 NFS 258 V4 Call READ StateID: 0x900a Offset: 0 Len: 5
4.888349784 10.210.196.134 -> 10.210.16.12 NFS 166 V4 Reply (Call In 50) READ
4.888516437 10.210.16.12 -> 10.210.196.134 NFS 266 V4 Call CLOSE StateID: 0x900a
5.093391513 10.210.196.134 -> 10.210.16.12 NFS 230 V4 Reply (Call In 54) CLOSE

So ONTAP responds with NFS4ERR_STALE_CLIENTID instead of NFS4ERR_EXPIRED.
I think this is handled by fs/nfs/nfs4proc.c: nfs4_do_handle_exception()
Let's confirm it:

# cat ./nfs4_do_handle_exception.stp
#!/bin/stap


// /bin/stap -g -d /bin/ls --ldd --all-modules


/*
 * module("nfsv4").function("nfs4_do_handle_exception@fs/nfs/nfs4proc.c:383") $server:struct nfs_server* $errorcode:int $exception:struct nfs4_exception* $__func__:char[] const
*/


probe  module("nfsv4").function("nfs4_do_handle_exception") {

        printf("%s %s proc: %s[%d] ppid: %d args: %s cwd: %s uid: %d\n", ctime(gettimeofday_s()), probefunc(), execname(), 
               pid(), ppid(), cmdline_str(), fullpath_struct_path(task_cwd_path(task_current())), uid())
        print_backtrace()
        printf("-------\n")
        //print_ubacktrace()
        //printf("-------\n")
}

# ./nfs4_do_handle_exception.stp --all-modules -o /var/tmp/nfs4_do_handle_exception.out &

$ date; strace -e trace=open cat /mnt/.../f1
Fri  6 Dec 11:28:32 GMT 2019
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/mnt/.../f1", O_RDONLY) = 3
test
+++ exited with 0 +++

11:28:32.419449800 10.210.16.12 -> 10.210.196.134 NFS 250 V4 Call ACCESS FH: 0x49b41796, [Check: RD LU MD XT DL]
11:28:32.624732095 10.210.196.134 -> 10.210.16.12 NFS 222 V4 Reply (Call In 24) ACCESS, [Access Denied: MD XT DL], [Allowed: RD LU]
11:28:32.625066507 10.210.16.12 -> 10.210.196.134 NFS 250 V4 Call ACCESS FH: 0x951a2495, [Check: RD LU MD XT DL]
11:28:32.830312285 10.210.196.134 -> 10.210.16.12 NFS 222 V4 Reply (Call In 27) ACCESS, [Allowed: RD LU MD XT DL]
11:28:32.830765027 10.210.16.12 -> 10.210.196.134 NFS 250 V4 Call ACCESS FH: 0x6d8f1235, [Check: RD LU MD XT DL]
11:28:33.046220915 10.210.196.134 -> 10.210.16.12 NFS 222 V4 Reply (Call In 29) ACCESS, [Access Denied: MD XT DL], [Allowed: RD LU]
11:28:33.046554768 10.210.16.12 -> 10.210.196.134 NFS 322 V4 Call OPEN DH: 0x6d8f1235/f1
11:28:33.251846761 10.210.196.134 -> 10.210.16.12 NFS 150 V4 Reply (Call In 31) OPEN Status: NFS4ERR_STALE_CLIENTID
11:28:33.252467673 10.210.16.12 -> 10.210.196.134 NFS 230 V4 Call RENEW CID: 0x3e59
11:28:33.457711654 10.210.196.134 -> 10.210.16.12 NFS 182 V4 Reply (Call In 33) RENEW Status: NFS4ERR_STALE_CLIENTID
11:28:33.457873262 10.210.16.12 -> 10.210.196.134 NFS 338 V4 Call SETCLIENTID
11:28:33.663124308 10.210.196.134 -> 10.210.16.12 NFS 198 V4 Reply (Call In 35) SETCLIENTID
11:28:33.663262733 10.210.16.12 -> 10.210.196.134 NFS 238 V4 Call SETCLIENTID_CONFIRM
11:28:33.868584648 10.210.196.134 -> 10.210.16.12 NFS 182 V4 Reply (Call In 37) SETCLIENTID_CONFIRM
11:28:33.868821676 10.210.16.12 -> 10.210.196.134 NFS 322 V4 Call OPEN DH: 0x6d8f1235/f1
11:28:34.074306487 10.210.196.134 -> 10.210.16.12 NFS 470 V4 Reply (Call In 41) OPEN StateID: 0x985c
11:28:34.074508377 10.210.16.12 -> 10.210.196.134 NFS 258 V4 Call READ StateID: 0x985c Offset: 0 Len: 5
11:28:34.279740081 10.210.196.134 -> 10.210.16.12 NFS 166 V4 Reply (Call In 47) READ
11:28:34.279909666 10.210.16.12 -> 10.210.196.134 NFS 266 V4 Call CLOSE StateID: 0x985c
11:28:34.485207915 10.210.196.134 -> 10.210.16.12 NFS 230 V4 Reply (Call In 51) CLOSE

# less /var/tmp/nfs4_do_handle_exception.out
...
-------
Fri Dec  6 11:28:32 2019 nfs4_do_handle_exception proc: cat[5231] ppid: 5229 args: cat /mnt/.../f1 cwd: /...w uid: 14100
 0xffffffffc05d5a80 : nfs4_do_handle_exception+0x0/0x340 [nfsv4]
 0xffffffffc05dcf0c : nfs4_handle_exception+0x1c/0x160 [nfsv4]
 0xffffffffc05df973 : nfs4_proc_access+0x83/0xf0 [nfsv4]
 0xffffffffc0597adb : nfs_do_access+0x1bb/0x3b0 [nfs]
 0xffffffffc0597ebe : nfs_permission+0x1be/0x220 [nfs]
 0xffffffff91054471 : __inode_permission+0x71/0xd0 [kernel]
 0xffffffff910544e8 : inode_permission+0x18/0x50 [kernel]
 0xffffffff9105854e : link_path_walk+0x27e/0x8b0 [kernel]
 0xffffffff91059c3e : path_openat+0xae/0x5a0 [kernel]
 0xffffffff9105b5cd : do_filp_open+0x4d/0xb0 [kernel]
 0xffffffff91047524 : do_sys_open+0x124/0x220 [kernel]
 0xffffffff9104763e : sys_open+0x1e/0x20 [kernel]
 0xffffffff9158d172 : tracesys+0xa6/0xcc [kernel]
 0x7f6b0d228710
-------
Fri Dec  6 11:28:32 2019 nfs4_do_handle_exception proc: cat[5231] ppid: 5229 args: cat /mnt/.../f1 cwd: /... uid: 14100
 0xffffffffc05d5a80 : nfs4_do_handle_exception+0x0/0x340 [nfsv4]
 0xffffffffc05dcf0c : nfs4_handle_exception+0x1c/0x160 [nfsv4]
 0xffffffffc05df973 : nfs4_proc_access+0x83/0xf0 [nfsv4]
 0xffffffffc0597adb : nfs_do_access+0x1bb/0x3b0 [nfs]
 0xffffffffc0597ebe : nfs_permission+0x1be/0x220 [nfs]
 0xffffffff91054471 : __inode_permission+0x71/0xd0 [kernel]
 0xffffffff910544e8 : inode_permission+0x18/0x50 [kernel]
 0xffffffff9105854e : link_path_walk+0x27e/0x8b0 [kernel]
 0xffffffff91059c3e : path_openat+0xae/0x5a0 [kernel]
 0xffffffff9105b5cd : do_filp_open+0x4d/0xb0 [kernel]
 0xffffffff91047524 : do_sys_open+0x124/0x220 [kernel]
 0xffffffff9104763e : sys_open+0x1e/0x20 [kernel]
 0xffffffff9158d172 : tracesys+0xa6/0xcc [kernel]
 0x7f6b0d228710
-------
Fri Dec  6 11:28:33 2019 nfs4_do_handle_exception proc: cat[5231] ppid: 5229 args: cat /mnt/.../f1 cwd: /... uid: 14100
 0xffffffffc05d5a80 : nfs4_do_handle_exception+0x0/0x340 [nfsv4]
 0xffffffffc05dcf0c : nfs4_handle_exception+0x1c/0x160 [nfsv4]
 0xffffffffc05df973 : nfs4_proc_access+0x83/0xf0 [nfsv4]
 0xffffffffc0597adb : nfs_do_access+0x1bb/0x3b0 [nfs]
 0xffffffffc0597ebe : nfs_permission+0x1be/0x220 [nfs]
 0xffffffff91054471 : __inode_permission+0x71/0xd0 [kernel]
 0xffffffff910544e8 : inode_permission+0x18/0x50 [kernel]
 0xffffffff9105854e : link_path_walk+0x27e/0x8b0 [kernel]
 0xffffffff91059c3e : path_openat+0xae/0x5a0 [kernel]
 0xffffffff9105b5cd : do_filp_open+0x4d/0xb0 [kernel]
 0xffffffff91047524 : do_sys_open+0x124/0x220 [kernel]
 0xffffffff9104763e : sys_open+0x1e/0x20 [kernel]
 0xffffffff9158d172 : tracesys+0xa6/0xcc [kernel]
 0x7f6b0d228710
-------
Fri Dec  6 11:28:33 2019 nfs4_do_handle_exception proc: cat[5231] ppid: 5229 args: cat /mnt/.../f1 cwd: /... uid: 14100
 0xffffffffc05d5a80 : nfs4_do_handle_exception+0x0/0x340 [nfsv4]
 0xffffffffc05dcf0c : nfs4_handle_exception+0x1c/0x160 [nfsv4]
 0xffffffffc05e0296 : nfs4_do_open+0x2c6/0x630 [nfsv4]
 0xffffffffc05e06f7 : nfs4_atomic_open+0xf7/0x110 [nfsv4]
 0xffffffffc0597317 : nfs_atomic_open+0x197/0x3a0 [nfs]
 0xffffffff91057623 : do_last+0xa53/0x1290 [kernel]
 0xffffffff91059c5d : path_openat+0xcd/0x5a0 [kernel]
 0xffffffff9105b5cd : do_filp_open+0x4d/0xb0 [kernel]
 0xffffffff91047524 : do_sys_open+0x124/0x220 [kernel]
 0xffffffff9104763e : sys_open+0x1e/0x20 [kernel]
 0xffffffff9158d172 : tracesys+0xa6/0xcc [kernel]
 0x7f6b0d228710
-------

The nfs4_do_open() function does end up calling nfs4_do_handle_exception(), the relevant code is:

static int nfs4_do_handle_exception(struct nfs_server *server,
                int errorcode, struct nfs4_exception *exception)
{
...
                case -NFS4ERR_EXPIRED:
                case -NFS4ERR_BAD_STATEID:
                        if (inode != NULL && stateid != NULL) {
                                nfs_inode_find_state_and_recover(inode,
                                                stateid);
                                goto wait_on_recovery;
                        }
...
                case -NFS4ERR_STALE_STATEID:
                case -NFS4ERR_STALE_CLIENTID:
                        nfs4_schedule_lease_recovery(clp);
                        goto wait_on_recovery;
...

In another function they handle both of them the same though:

@@ -5960,6 +5962,8 @@ static void nfs4_release_lockowner_done(struct rpc_task *task, void *calldata)
                break;
        case -NFS4ERR_STALE_CLIENTID:
        case -NFS4ERR_EXPIRED:
+               nfs4_schedule_lease_recovery(server->nfs_client);
+               break;

Not sure why it is handled differently in nfs4_do_handle_exception().
This was changed by:

commit 272289a3df7297fd798a8ddf652c6d29f675de28
Author: Trond Myklebust
Date:   Thu Sep 22 13:39:15 2016 -0400

    NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid

    If we're not yet sure that all state has expired or been revoked, we
    should try to do a minimal recovery on just the one stateid.

    Signed-off-by: Trond Myklebust
    Tested-by: Oleg Drokin
    Signed-off-by: Anna Schumaker

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 389089e..409b9fa 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -397,13 +397,23 @@ static int nfs4_do_handle_exception(struct nfs_server *server,
        exception->delay = 0;
        exception->recovering = 0;
        exception->retry = 0;
+
+       if (stateid == NULL && state != NULL)
+               stateid = &state->stateid;

The new nfs v4 client behavior in Linux was introduced by https://github.com/torvalds/linux/commit/83ca7f5ab31f57506bb35482e4b0426be653196a. This is the first time the linux client assumes implicit lease renewal when it calls fsinfio() (getattrs for mounts). The code was later refactored a little bit but latest upstream version still does it (at the time of debugging the issue).

The below two patches I contributed to Linux kernel are now also present in CentOS 8.3 with kernel 4.18.0-240.1.1.el8_3 and newer. https://github.com/torvalds/linux/commit/924491f2e476f7234d722b24171a4daff61bbe13

NFSv4: try lease recovery on NFS4ERR_EXPIRED

Currently, if an nfs server returns NFS4ERR_EXPIRED to open(),
we return EIO to applications without even trying to recover.

Fixes: 272289a ("NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid")
Signed-off-by: Robert Milkowski
Reviewed-by: Trond Myklebust
Signed-off-by: Anna Schumaker

https://github.com/torvalds/linux/commit/7dc2993a9e51dd2eee955944efec65bef90265b7

NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals

Currently, each time nfs4_do_fsinfo() is called it will do an implicit
NFS4 lease renewal, which is not compliant with the NFS4 specification.
This can result in a lease being expired by an NFS server.

Commit 83ca7f5 ("NFS: Avoid PUTROOTFH when managing leases")
introduced implicit client lease renewal in nfs4_do_fsinfo(),
which can result in the NFSv4.0 lease to expire on a server side,
and servers returning NFS4ERR_EXPIRED or NFS4ERR_STALE_CLIENTID.

This can easily be reproduced by frequently unmounting a sub-mount,
then stat'ing it to get it mounted again, which will delay or even
completely prevent client from sending RENEW operations if no other
NFS operations are issued. Eventually nfs server will expire client's
lease and return an error on file access or next RENEW.

This can also happen when a sub-mount is automatically unmounted
due to inactivity (after nfs_mountpoint_expiry_timeout), then it is
mounted again via stat(). This can result in a short window during
which client's lease will expire on a server but not on a client.
This specific case was observed on production systems.

This patch removes the implicit lease renewal from nfs4_do_fsinfo().

Fixes: 83ca7f5 ("NFS: Avoid PUTROOTFH when managing leases")
Signed-off-by: Robert Milkowski
Signed-off-by: Anna Schumaker

2 comments:

Anonymous said...: As of the Linux v6.0 kernel, it’s NFS server also acts courteously to clients that do not renew their lease on time. The server holds unconflicted open and lock state as long as there is adequate memory available.; 04 January, 2023 04:25
Robert said...: Educative post, thanks;); 07 January, 2023 10:09