Wednesday, December 28, 2022

NFSv4 + Kerberos: 4 minutes of slowness

It’s a blog entry I started writing over 3 years ago and never managed to actually finish/publish it, until now.
I haven’t checked in a while if latest Linux kernel is still affected by the described behaviour here, but I suspect it is.

An application running on CentOS 6.9 - it often copies files (4-6GB in size on average) from a local SSD based cache to NFSv4 (+krb) mount. Developers complained that twice a day the copy takes significantly longer to complete without any obvious reason. They even added logging the their code to log all occurrences when a copy of a 64MB chunk takes longer than 10,000ms.

After some quick testing I determined that reads are fast, but writes sometimes would be very slow, despite them being buffered writes on a server with lots of free memory. No obvious bottlenecks on network, no packet drops, etc.

Initially I suspected write back throttling, but quickly realised this was not the culprit.

After tracing the java code I determined that the application ends up calling sendfile() with 64MB chunks. Nothing unusual here, but just in case there is some sendfile() specific behaviour here, I tried to reproduce the issue with read()/write() loop with 64MB chunks and hit the same issue. The issue also occurred with  smaller chunk sizes.

A test version of the app (with only the relevant/affected file copy code):

$ strace -tT -v -f -o /tmp/a java CopyTest /.../f1 /mnt/.../tmp/f1
Copying a chunk of 67076096 bytes from position 0 in /…/f1 to /…/tmp/f1 took 28005 ms (2.3 MB/s)
Switching copy chunk size to 33554432 bytes (32 MB)
Copying a chunk of 33554432 bytes from position 67076096 in /…/f1 to /…/tmp/f1 took 14161 ms (2.3 MB/s)
Switching copy chunk size to 16777216 bytes (16 MB)
Copying a chunk of 16777216 bytes from position 100630528 in /…/f1 to /…/tmp/f1 took 7555 ms (2.1 MB/s)
Copying a chunk of 16777216 bytes from position 117407744 in /…/f1 to /…/tmp/f1 took 7111 ms (2.3 MB/s)
Copying a chunk of 16777216 bytes from position 134184960 in /…/f1 to /…/tmp/f1 took 6595 ms (2.4 MB/s)
^C

The app ends up calling sendfile() in a loop with 64MB block (or smaller if it detects it is slow), which should be ok. However each sendfile() takes quite a long time to finish (when the issue was happening):
15212 15:58:25 open("/…/f1", O_RDONLY) = 4 <0.000038>
...
15212 15:58:25 open("/…/tmp/f1", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6 <0.003453>
...
15212 15:58:25 sendfile(6, 4, [0], 67076096 <unfinished ...>
15212 15:58:53 <... sendfile resumed> ) = 67076096 <28.004107>
...
15212 15:58:53 sendfile(6, 4, [67076096], 33554432 <unfinished ...>
15212 15:59:07 <... sendfile resumed> ) = 33554432 <14.160558>
...
15212 15:59:07 sendfile(6, 4, [100630528], 16777216 <unfinished ...>
15212 15:59:14 <... sendfile resumed> ) = 16777216 <7.555443>
...
15212 15:59:14 sendfile(6, 4, [117407744], 16777216 <unfinished ...>
15212 15:59:22 <... sendfile resumed> ) = 16777216 <7.110757>
...
This is less than 3MB/s. The iostat output while the above test was running:
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s
hostxxx:/mount/point           0.00      3968.00         0.00         0.00         0.00      3936.00    509.00      0.00    492.00
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s
hostxxx:/mount/point           0.00      3968.00         0.00         0.00         0.00      4040.00    581.00      0.00    505.00
There are no reads from local disks as the file is cached in memory.
No network congestion, no increased load on the nfs servers during the slowdowns, etc.
The nfsiostat output (while writes are slow):
# nfsiostat /mnt/…/ 1

hostxxx:/mount/point mounted on /mnt/…/:

   op/s         rpc bklog
 251.20            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                 38.022         2419.958         63.647      320 (0.0%)           5.031           7.105
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                106.765         5446.704         51.016       40 (0.0%)           3.165         4799.198

hostxxx:/mount/point mounted on /mnt/…/:

   op/s         rpc bklog
 476.00            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                476.000         2076.922          4.363        0 (0.0%)           2.025           2.092

hostxxx:/mount/point mounted on /mnt/…/:

   op/s         rpc bklog
 494.00            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                494.000         2155.461          4.363        0 (0.0%)           1.996           2.026

hostxxx:/mount/point mounted on /mnt/…/:

   op/s         rpc bklog
 367.00            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                322.000         1404.977          4.363        0 (0.0%)           3.099           3.130

hostxxx:/mount/point mounted on /mnt/…/:

   op/s         rpc bklog
 453.00            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                453.000         1976.566          4.363        0 (0.0%)           2.179           2.212

hostxxx:/mount/point mounted on /mnt/…/:

   op/s         rpc bklog
 437.00            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                437.000         1906.754          4.363        0 (0.0%)           2.256           2.288
^C
Let's compare to a fast run:
hostxxx:/mount/point mounted on /mnt/…/:

   op/s         rpc bklog
5178.00         33515.07
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                5032.000        323876.031       64.363        0 (0.0%)           2.908         13164.891

hostxxx:/mount/point mounted on /mnt/…/:

   op/s         rpc bklog
4830.00         28661.95
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                4672.000        300705.250       64.363        0 (0.0%)           3.191         14171.388

hostxxx:/mount/point mounted on /mnt/.../:

   op/s         rpc bklog
5030.00         23886.62
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                4875.000        313770.996       64.363        0 (0.0%)           3.028         15096.526
The RTT is about the same in slow and fast runs. The avg exe is completely different, but this seems to be due to large rpc backlog in fast case while no queuing in slow case. Also avg block size is just 4KB in slow case vs. 64KB in fast case. I managed to reproduce the issue with cp command which uses read()/write() syscalls.
Fast/good case:
64477 17:31:46 read(3, "..."..., 65536) = 65536 <0.000028>
64477 17:31:46 write(4, "..."..., 65536) = 65536 <0.000048>
64477 17:31:46 read(3, "..."..., 65536) = 65536 <0.000028>
64477 17:31:46 write(4, "..."..., 65536) = 65536 <0.000058>
64477 17:31:46 read(3, "..."..., 65536) = 65536 <0.000028>
64477 17:31:46 write(4, "..."..., 65536) = 65536 <0.000059>
64477 17:31:46 read(3, "..."..., 65536) = 65536 <0.000029>
64477 17:31:46 write(4, "..."..., 65536) = 65536 <0.000052>
slow/bad case:
62376 17:27:02 read(3, "..."..., 65536) = 65536 <0.000060>
62376 17:27:02 write(4, "..."..., 65536) = 65536 <0.021412>
62376 17:27:02 read(3, "..."..., 65536) = 65536 <0.000046>
62376 17:27:02 write(4, "..."..., 65536) = 65536 <0.022443>
62376 17:27:02 read(3, "..."..., 65536) = 65536 <0.000050>
62376 17:27:02 write(4, "..."..., 65536) = 65536 <0.032568>
62376 17:27:02 read(3, "..."..., 65536) = 65536 <0.000054>
62376 17:27:02 write(4, "..."..., 65536) = 65536 <0.038464>           
Similar numbers to sendfile() case. While the issue is happening, let' s collect kernel stacks:
# perf script -i perf.data.s3
Failed to open /tmp/perf-7054.map, continuing without symbols
java  7055 [040] 43655512.790932: writeback:wbc_writepage: bdi 0:484: towrt=9223372036854775807 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 more=0 older=0x0 start=0x0 end=0x7fffffffffffffff
        ffffffff811c9046 ftrace_profile_wbc_writepage ([kernel.kallsyms])
        ffffffff811441da write_cache_pages ([kernel.kallsyms])
                   199d2 nfs_writepages ([nfs])
        ffffffff81144291 do_writepages ([kernel.kallsyms])
        ffffffff811c7a7d writeback_single_inode ([kernel.kallsyms])
        ffffffff811c7c63 sync_inode ([kernel.kallsyms])
                   18983 nfs_wb_all ([nfs])
                    8a8c nfs_write_end ([nfs])
        ffffffff8112e210 pagecache_write_end ([kernel.kallsyms])
        ffffffff811cb0a6 pipe_to_file ([kernel.kallsyms])
        ffffffff811c9ce2 splice_from_pipe_feed ([kernel.kallsyms])
        ffffffff811cadbe generic_file_splice_write_actor ([kernel.kallsyms])
        ffffffff811cae9c splice_write_to_file ([kernel.kallsyms])
        ffffffff811caf45 generic_file_splice_write ([kernel.kallsyms])
                    79c6 nfs_file_splice_write ([nfs])
        ffffffff811ca96a do_splice_from ([kernel.kallsyms])
        ffffffff811ca9c0 direct_splice_actor ([kernel.kallsyms])
        ffffffff811cabf6 splice_direct_to_actor ([kernel.kallsyms])
        ffffffff811cad3d do_splice_direct ([kernel.kallsyms])
        ffffffff8119abc4 do_sendfile ([kernel.kallsyms])
        ffffffff8119aca1 sys_sendfile64 ([kernel.kallsyms])
        ffffffff81556951 tracesys ([kernel.kallsyms])
                   df81a sendfile (/lib64/libc-2.12.so)
            7fb4bcbd07a7 [unknown] (/tmp/perf-7054.map)
            7fb4bcbbfe50 [unknown] (/tmp/perf-7054.map)
            7fb4bcbbfe50 [unknown] (/tmp/perf-7054.map)
            7fb4bcbbfe50 [unknown] (/tmp/perf-7054.map)
            7fb4bcbbfe50 [unknown] (/tmp/perf-7054.map)
            7fb4bcbb87a7 [unknown] (/tmp/perf-7054.map)
                  697a76 JavaCalls::call_helper (/.../pkgs/jdk1.8.0_181/jre/lib/amd64/server/libjvm.so)
                  6d91c2 jni_invoke_static (/.../pkgs/jdk1.8.0_181/jre/lib/amd64/server/libjvm.so)
                  6f5a2a jni_CallStaticVoidMethod (/.../pkgs/jdk1.8.0_181/jre/lib/amd64/server/libjvm.so)
                    80ff JavaMain (/.../pkgs/jdk1.8.0_181/lib/amd64/jli/libjli.so)
                    7aa1 start_thread (/lib64/libpthread-2.12.so)
What's interesting here is:
...
        ffffffff811c7c63 sync_inode ([kernel.kallsyms])
                   18983 nfs_wb_all ([nfs])
                    8a8c nfs_write_end ([nfs])
...
This is the corresponding source code in kernel:
./fs/nfs/file.c:
    417 static int nfs_write_end(struct file *file, struct address_space *mapping,
…
    458         if (nfs_ctx_key_to_expire(ctx, mapping->host)) {
    459                 status = nfs_wb_all(mapping->host);
    460                 if (status < 0)
    461                         return status;
    462         }
In order to call nfs_wb_all() which is flushing all pages and making app issuing synchronous 4kb i/o (the performance issue) the nfs_ctx_key_to_expire() has to return non zero value. Let's check the code when it happens:
./fs/nfs/file.c:
    417 static int nfs_write_end(struct file *file, struct address_space *mapping,
…
    458         if (nfs_ctx_key_to_expire(ctx, mapping->host)) {
    459                 status = nfs_wb_all(mapping->host);
    460                 if (status < 0)
    461                         return status;
    462         }

./fs/nfs/write.c:
bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct inode *inode)
{
        struct rpc_auth *auth = NFS_SERVER(inode)->client->cl_auth;

        return rpcauth_cred_key_to_expire(auth, ctx->cred);
}

./net/sunrpc/auth.c:
bool
rpcauth_cred_key_to_expire(struct rpc_auth *auth, struct rpc_cred *cred)
{
        if (auth->au_flags & RPCAUTH_AUTH_NO_CRKEY_TIMEOUT)
                return false;
        if (!cred->cr_ops->crkey_to_expire)
                return false;
        return cred->cr_ops->crkey_to_expire(cred);
}

./net/sunrpc/auth_generic.c:    .crkey_to_expire = generic_key_to_expire,

./net/sunrpc/auth_generic.c:
static bool generic_key_to_expire(struct rpc_cred *cred)
{
        struct auth_cred *acred = &container_of(cred, struct generic_cred,
                                                gc_base)->acred;
        bool ret;

        get_rpccred(cred);
        ret = test_bit(RPC_CRED_KEY_EXPIRE_SOON, &acred->ac_flags);
        put_rpccred(cred);

        return ret;
}

static const struct rpc_credops generic_credops = {
        .cr_name = "Generic cred",
        .crdestroy = generic_destroy_cred,
        .crbind = generic_bind_cred,
        .crmatch = generic_match,
        .crkey_to_expire = generic_key_to_expire,
};

./net/sunrpc/auth_generic.c
    202 generic_key_timeout(struct rpc_auth *auth, struct rpc_cred *cred)
    224         ret = tcred->cr_ops->crkey_timeout(tcred);
    225         if (ret != 0) {
    226                 set_bit(RPC_CRED_KEY_EXPIRE_SOON, &acred->ac_flags);
    227                 ret = 0;
    228         } else {
    229                 /* In case underlying cred key has been reset */
    230                 if (test_and_clear_bit(RPC_CRED_KEY_EXPIRE_SOON,
    231                                         &acred->ac_flags))
    232                         dprless intk("RPC:        UID %d Credential key reset\n",
    233                                 tcred->cr_uid);
    234                 /* set up fasttrack for the normal case */
    235                 set_bit(RPC_CRED_NOTIFY_TIMEOUT, &acred->ac_flags);
    236         }

./net/sunrpc/auth_gss/auth_gss.c:       .crkey_timeout  = gss_key_timeout,

./net/sunrpc/auth_gss/auth_gss.c:


./net/sunrpc/auth_gss/auth_gss.c:
#define GSS_KEY_EXPIRE_TIMEO 240
static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO;

static int
gss_key_timeout(struct rpc_cred *rc)
{
        struct gss_cred *gss_cred = container_of(rc, struct gss_cred, gc_base);
        unsigned long now = jiffies;
        unsigned long expire;

        if (gss_cred->gc_ctx == NULL)
                return -EACCES;

        expire = gss_cred->gc_ctx->gc_expiry - (gss_key_expire_timeo * HZ);

        if (time_after(now, expire))
                return -EACCES;
        return 0;
}
To hit the code path the RPC_CRED_KEY_EXPIRE_SOON flag has to be set. Let's use kprobes to trace the relevant kernel functions and see what they return:
# ./kprobe -d 1 'r:nfs_ctx_key_to_expire $retval'
Tracing kprobe nfs_ctx_key_to_expire for 1 seconds (buffered)...
           <...>-80656 [005] 21002.939170: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.940730: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.942151: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.943556: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.944987: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.946327: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.947668: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.949157: nfs_ctx_key_to_expire: (nfs_file_splice_write+0xfd/0x140 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.949171: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.950754: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.952075: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
           <...>-80656 [005] 21002.953633: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1
…


# ./kprobe -d 1 'r:rpcauth_cred_key_to_expire $retval'
           <...>-80656 [005] 21511.495978: rpcauth_cred_key_to_expire: (nfs_ctx_key_to_expire+0x2b/0x30 [nfs] <- rpcauth_cred_key_to_expire) arg1=1
           <...>-80656 [005] 21511.497206: rpcauth_cred_key_to_expire: (nfs_ctx_key_to_expire+0x2b/0x30 [nfs] <- rpcauth_cred_key_to_expire) arg1=1
           <...>-80656 [005] 21511.498429: rpcauth_cred_key_to_expire: (nfs_ctx_key_to_expire+0x2b/0x30 [nfs] <- rpcauth_cred_key_to_expire) arg1=1
           <...>-80656 [005] 21511.498443: rpcauth_cred_key_to_expire: (nfs_ctx_key_to_expire+0x2b/0x30 [nfs] <- rpcauth_cred_key_to_expire) arg1=1
…

# ./kprobe -d 1 'r:generic_key_to_expire $retval'
           <...>-80656 [005] 21663.712048: generic_key_to_expire: (rpcauth_cred_key_to_expire+0x21/0x30 [sunrpc] <- generic_key_to_expire) arg1=1
           <...>-80656 [005] 21663.713661: generic_key_to_expire: (rpcauth_cred_key_to_expire+0x21/0x30 [sunrpc] <- generic_key_to_expire) arg1=1
           <...>-80656 [005] 21663.715152: generic_key_to_expire: (rpcauth_cred_key_to_expire+0x21/0x30 [sunrpc] <- generic_key_to_expire) arg1=1
           <...>-80656 [005] 21663.716583: generic_key_to_expire: (rpcauth_cred_key_to_expire+0x21/0x30 [sunrpc] <- generic_key_to_expire) arg1=1
…
The generic_key_to_expire() function returns 1 while the issue is happening. All it does is it tests if RPC_CRED_KEY_EXPIRE_SOON bit is set:
static bool generic_key_to_expire(struct rpc_cred *cred)
{
        struct auth_cred *acred = &container_of(cred, struct generic_cred,
                                                gc_base)->acred;
        bool ret;

        get_rpccred(cred);
        ret = test_bit(RPC_CRED_KEY_EXPIRE_SOON, &acred->ac_flags);
        put_rpccred(cred);

        return ret;
}
Let's get kernel stack for the cp case:
cp 34297 [001] 25235.714806: writeback:wbc_writepage: bdi 0:1479: towrt=9223372036854775807 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 more=0 older=0x0 start=0x0 end=0x7fffffffffffffff
            7fff811ca046 ftrace_profile_wbc_writepage ([kernel.kallsyms])
            7fff811451da write_cache_pages ([kernel.kallsyms])
            7fffa041b962 nfs_writepages ([kernel.kallsyms])
            7fff81145291 do_writepages ([kernel.kallsyms])
            7fff811c8a7d writeback_single_inode ([kernel.kallsyms])
            7fff811c8c63 sync_inode ([kernel.kallsyms])
            7fffa041a913 nfs_wb_all ([kernel.kallsyms])
            7fffa040aa1c nfs_write_end ([kernel.kallsyms])
            7fff8113087a generic_file_buffered_write ([kernel.kallsyms])
            7fff81132270 __generic_file_aio_write ([kernel.kallsyms])
            7fff81132528 generic_file_aio_write ([kernel.kallsyms])
            7fffa040a3f4 nfs_file_write ([kernel.kallsyms])
            7fff8119b47a do_sync_write ([kernel.kallsyms])
            7fff8119b778 vfs_write ([kernel.kallsyms])
            7fff8119c2b1 sys_write ([kernel.kallsyms])
            7fff81557951 tracesys ([kernel.kallsyms])
                   db7d0 __write_nocancel (/lib64/libc-2.12.so)
This is interesting, as vfs_write() is calling do_sync_write(), the corresponding source code is:
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
…
                if (file->f_op->write)
                        ret = file->f_op->write(file, buf, count, pos);
                else
                        ret = do_sync_write(file, buf, count, pos);
In more recent kernels the code around the RPC_CRED_KEY_EXPIRE_SOON changed, for example:
commit ddf529eeed59184c49dcad1633c11831f822bf6b
Author: NeilBrown <neilb@suse.com>
Date:   Mon Dec 3 11:30:30 2018 +1100

    NFS: move credential expiry tracking out of SUNRPC into NFS.

    NFS needs to know when a credential is about to expire so that
    it can modify write-back behaviour to finish the write inside the
    expiry time.
    It currently uses functions in SUNRPC code which make use of a
    fairly complex callback scheme and flags in the generic credientials.

    As I am working to discard the generic credentials, this has to change.

    This patch moves the logic into NFS, in part by finding and caching
    the low-level credential in the open_context.  We then make direct
    cred-api calls on that.

    This makes the code much simpler and removes a dependency on generic
    rpc credentials.

    Signed-off-by: NeilBrown <neilb@suse.com>
    Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There are other changes as well, and the issue might be fixed in more recent kernels. However I reproduced the issue on Linux kernel 5.5.0-rc2 - latest at the time of this debugging.
I wrote stap script which instruments nfs_ctx_key_to_expire() function and prints some basic information including if RPC_CRED_KEY_EXPIRE_SOON bit is set. It also instruments gss_key_timeout() function and prints current time, gc_expire and delta.
# cat nfs_ctx_key_to_expire.stp
#!/usr/bin/env stap


global RPC_CRED_KEY_EXPIRE_SOON = 2; // bit 1
global RPC_CRED_NOTIFY_TIMEOUT = 4;  // bit 2

global IN;
global AC_FLAGS;

@define AC_FLAGS_TO_STR(ac_flags) %(
    " " .
    (@ac_flags & RPC_CRED_KEY_EXPIRE_SOON ? "RPC_CRED_KEY_EXPIRE_SOON" : "") .
    (@ac_flags & RPC_CRED_NOTIFY_TIMEOUT ? "RPC_CRED_NOTIFY_TIMEOUT" : "") .
    " "
%)


probe module("nfs").function("nfs_ctx_key_to_expire") {
    IN[tid()] = 1;
}

probe module("nfs").function("nfs_ctx_key_to_expire").return {
    // probefunc() returns outside function in .return and pp() is too verbose, so let's hard-code string
    printf("%s %s()=%d proc: %s[%d] uid: %d ", ctime(gettimeofday_s()), "nfs_ctx_key_to_expire", $return, execname(), pid(), uid());
    printf("ac_flags: %d [%s] ", AC_FLAGS[tid()], @AC_FLAGS_TO_STR(AC_FLAGS[tid()]));
    if ($# > 0 && @1 == "inode_name") {
        printf("inode_name: %s", inode_name(@entry($inode)));
    }
    printf("\n");

    delete AC_FLAGS[tid()];
    delete IN[tid()];
}


probe module("sunrpc").function("generic_key_to_expire").return {
    if (IN[tid()]) {
        acred = & @container_of(@entry($cred), "struct generic_cred", gc_base)->acred;
        AC_FLAGS[tid()] = acred->ac_flags;

/*
        flags_s = " ";
        flags_s = flags_s . (acred->ac_flags & RPC_CRED_KEY_EXPIRE_SOON ? "RPC_CRED_KEY_EXPIRE_SOON" : "")
        flags_s = flags_s . (acred->ac_flags & RPC_CRED_NOTIFY_TIMEOUT ? "RPC_CRED_NOTIFY_TIMEOUT" : "")
        printf("ac_flags: %d [%s ]\n", acred->ac_flags, flags_s);
*/
    }
}


probe module("auth_rpcgss").function("gss_key_timeout").return {
    printf("%s %s()=%d proc: %s[%d] uid: %d ", ctime(gettimeofday_s()), "gss_key_timeout", $return, execname(), pid(), uid());
    gss_cred = & @container_of(@entry($rc), "struct gss_cred", gc_base);
    printf("now: %d ", jiffies()); // @entry($now) is broken
    printf("gc_expire: %d %d ", gss_cred->gc_ctx->gc_expiry, gss_cred->gc_ctx->gc_expiry - 240 * HZ());
    printf("delta: %d %d", (gss_cred->gc_ctx->gc_expiry - jiffies())/HZ(), (gss_cred->gc_ctx->gc_expiry - 240 * HZ() - jiffies())/HZ());
    printf("\n");

    print_backtrace();
}
Then I modified key timeout used by rpc-gssd to 5 minutes for easier/quicker testing:
# vi /etc/sysconfig/nfs
#RPCGSSDARGS="-t 1800"
RPCGSSDARGS="-t 300"

# systemctl restart rpc-gssd
This makes ticket valid for 5 minutes in the kernel. Now in another terminal I run:
$ while [ 1 ]; do /bin/echo aa >nethome/test ; rm nethome/test; sleep 5; done
While also running:
# ./nfs_ctx_key_to_expire.stp -o /var/tmp/aa xx &

# tail -f /var/tmp/aa
...
Thu Aug  1 14:43:33 2019 nfs_ctx_key_to_expire()=0 proc: echo[31151] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:43:33 2019 nfs_ctx_key_to_expire()=0 proc: echo[31151] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:43:38 2019 nfs_ctx_key_to_expire()=0 proc: echo[31158] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:43:38 2019 nfs_ctx_key_to_expire()=0 proc: echo[31158] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:43:43 2019 nfs_ctx_key_to_expire()=0 proc: echo[31165] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:43:43 2019 nfs_ctx_key_to_expire()=0 proc: echo[31165] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:43:48 2019 gss_key_timeout()=-13 proc: echo[31172] uid: 14100 now: 7509867697 gc_expire: 7510107517 7509867517 delta: 239 0
Returning from:  0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss]
Returning to  :  0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc]
 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc]
 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs]
 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs]
 0xffffffff8cc1e763
 0xffffffff8cc1f240
 0xffffffff8cc2006f
 0xffffffff8d12579b
Thu Aug  1 14:43:48 2019 nfs_ctx_key_to_expire()=1 proc: echo[31172] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:43:48 2019 nfs_ctx_key_to_expire()=1 proc: echo[31172] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:43:53 2019 gss_key_timeout()=-13 proc: echo[31179] uid: 14100 now: 7509872712 gc_expire: 7510107517 7509867517 delta: 234 -5
Returning from:  0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss]
Returning to  :  0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc]
 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc]
 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs]
 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs]
 0xffffffff8cc1e763
 0xffffffff8cc1f240
 0xffffffff8cc2006f
 0xffffffff8d12579b
...
Thu Aug  1 14:47:34 2019 nfs_ctx_key_to_expire()=1 proc: echo[32009] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:34 2019 nfs_ctx_key_to_expire()=1 proc: echo[32009] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:39 2019 gss_key_timeout()=-13 proc: echo[32018] uid: 14100 now: 7510098709 gc_expire: 7510107517 7509867517 delta: 8 -231
Returning from:  0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss]
Returning to  :  0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc]
 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc]
 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs]
 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs]
 0xffffffff8cc1e763
 0xffffffff8cc1f240
 0xffffffff8cc2006f
 0xffffffff8d125a1b
Thu Aug  1 14:47:39 2019 nfs_ctx_key_to_expire()=1 proc: echo[32018] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:39 2019 nfs_ctx_key_to_expire()=1 proc: echo[32018] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:44 2019 gss_key_timeout()=-13 proc: echo[32027] uid: 14100 now: 7510103729 gc_expire: 7510107517 7509867517 delta: 3 -236
Returning from:  0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss]
Returning to  :  0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc]
 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc]
 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs]
 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs]
 0xffffffff8cc1e763
 0xffffffff8cc1f240
 0xffffffff8cc2006f
 0xffffffff8d125a1b
Thu Aug  1 14:47:44 2019 nfs_ctx_key_to_expire()=1 proc: echo[32027] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:44 2019 nfs_ctx_key_to_expire()=1 proc: echo[32027] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:49 2019 gss_key_timeout()=0 proc: echo[32037] uid: 14100 now: 7510108758 gc_expire: 7510408751 7510168751 delta: 299 59
Returning from:  0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss]
Returning to  :  0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc]
 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc]
 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs]
 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs]
 0xffffffff8cc1e763
 0xffffffff8cc1f240
 0xffffffff8cc2006f
 0xffffffff8d125a1b
Thu Aug  1 14:47:49 2019 nfs_ctx_key_to_expire()=0 proc: echo[32037] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:47:49 2019 nfs_ctx_key_to_expire()=0 proc: echo[32037] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:47:54 2019 nfs_ctx_key_to_expire()=0 proc: echo[32046] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:47:54 2019 nfs_ctx_key_to_expire()=0 proc: echo[32046] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:48:00 2019 nfs_ctx_key_to_expire()=0 proc: echo[32069] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:48:00 2019 nfs_ctx_key_to_expire()=0 proc: echo[32069] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
But once we get to 5minutes - 240s window the RPC_CRED_KEY_EXPIRE_SOON bit is set which makes all writes synchronous (and slow). Once it expires, it gets renewed and then it is fine for another 1 minute (5m-4m). On some servers we had “-t 1800”, which resulted in 4 minutes of slowness every 30 minutes.
Just to double check I also measured time it takes to complete each write:
$ while [ 1 ]; do strace -tT -v -etrace=write /bin/echo aa >nethome/test ; rm nethome/test; sleep 5; done
15:47:19 write(1, "aa\n", 3)            = 3 <0.001536>
15:47:19 +++ exited with 0 +++
15:47:24 write(1, "aa\n", 3)            = 3 <0.001501>
15:47:24 +++ exited with 0 +++
15:47:29 write(1, "aa\n", 3)            = 3 <0.001427>
15:47:29 +++ exited with 0 +++
15:47:34 write(1, "aa\n", 3)            = 3 <0.001455>
15:47:34 +++ exited with 0 +++
15:47:39 write(1, "aa\n", 3)            = 3 <0.001503>
15:47:39 +++ exited with 0 +++
15:47:44 write(1, "aa\n", 3)            = 3 <0.001542>
15:47:44 +++ exited with 0 +++
15:47:49 write(1, "aa\n", 3)            = 3 <0.000132>
15:47:49 +++ exited with 0 +++
15:47:54 write(1, "aa\n", 3)            = 3 <0.000069>
15:47:54 +++ exited with 0 +++
15:48:00 write(1, "aa\n", 3)            = 3 <0.000338>
15:48:00 +++ exited with 0 +++
^C
Notice that initially each write() is slow (over 1ms) and then it becomes fast which matches to when the bit is cleared:
# grep 'Thu Aug  1 14:47' /var/tmp/aa
Thu Aug  1 14:47:01 2019 gss_key_timeout()=-13 proc: echo[31942] uid: 14100 now: 7510060000 gc_expire: 7510107517 7509867517 delta: 47 -192
Thu Aug  1 14:47:01 2019 nfs_ctx_key_to_expire()=1 proc: echo[31942] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:01 2019 nfs_ctx_key_to_expire()=1 proc: echo[31942] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:06 2019 gss_key_timeout()=-13 proc: echo[31951] uid: 14100 now: 7510065018 gc_expire: 7510107517 7509867517 delta: 42 -197
Thu Aug  1 14:47:06 2019 nfs_ctx_key_to_expire()=1 proc: echo[31951] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:06 2019 nfs_ctx_key_to_expire()=1 proc: echo[31951] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:11 2019 gss_key_timeout()=-13 proc: echo[31960] uid: 14100 now: 7510070037 gc_expire: 7510107517 7509867517 delta: 37 -202
Thu Aug  1 14:47:11 2019 nfs_ctx_key_to_expire()=1 proc: echo[31960] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:11 2019 nfs_ctx_key_to_expire()=1 proc: echo[31960] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:16 2019 gss_key_timeout()=-13 proc: echo[31972] uid: 14100 now: 7510075056 gc_expire: 7510107517 7509867517 delta: 32 -207
Thu Aug  1 14:47:16 2019 nfs_ctx_key_to_expire()=1 proc: echo[31972] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:16 2019 nfs_ctx_key_to_expire()=1 proc: echo[31972] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:19 2019 gss_key_timeout()=-13 proc: echo[31981] uid: 14100 now: 7510078637 gc_expire: 7510107517 7509867517 delta: 28 -211
Thu Aug  1 14:47:19 2019 nfs_ctx_key_to_expire()=1 proc: echo[31981] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:19 2019 nfs_ctx_key_to_expire()=1 proc: echo[31981] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:24 2019 gss_key_timeout()=-13 proc: echo[31991] uid: 14100 now: 7510083655 gc_expire: 7510107517 7509867517 delta: 23 -216
Thu Aug  1 14:47:24 2019 nfs_ctx_key_to_expire()=1 proc: echo[31991] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:24 2019 nfs_ctx_key_to_expire()=1 proc: echo[31991] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:29 2019 gss_key_timeout()=-13 proc: echo[32000] uid: 14100 now: 7510088673 gc_expire: 7510107517 7509867517 delta: 18 -221
Thu Aug  1 14:47:29 2019 nfs_ctx_key_to_expire()=1 proc: echo[32000] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:29 2019 nfs_ctx_key_to_expire()=1 proc: echo[32000] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:34 2019 gss_key_timeout()=-13 proc: echo[32009] uid: 14100 now: 7510093690 gc_expire: 7510107517 7509867517 delta: 13 -226
Thu Aug  1 14:47:34 2019 nfs_ctx_key_to_expire()=1 proc: echo[32009] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:34 2019 nfs_ctx_key_to_expire()=1 proc: echo[32009] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:39 2019 gss_key_timeout()=-13 proc: echo[32018] uid: 14100 now: 7510098709 gc_expire: 7510107517 7509867517 delta: 8 -231
Thu Aug  1 14:47:39 2019 nfs_ctx_key_to_expire()=1 proc: echo[32018] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:39 2019 nfs_ctx_key_to_expire()=1 proc: echo[32018] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:44 2019 gss_key_timeout()=-13 proc: echo[32027] uid: 14100 now: 7510103729 gc_expire: 7510107517 7509867517 delta: 3 -236
Thu Aug  1 14:47:44 2019 nfs_ctx_key_to_expire()=1 proc: echo[32027] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:44 2019 nfs_ctx_key_to_expire()=1 proc: echo[32027] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ]
Thu Aug  1 14:47:49 2019 gss_key_timeout()=0 proc: echo[32037] uid: 14100 now: 7510108758 gc_expire: 7510408751 7510168751 delta: 299 59
Thu Aug  1 14:47:49 2019 nfs_ctx_key_to_expire()=0 proc: echo[32037] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:47:49 2019 nfs_ctx_key_to_expire()=0 proc: echo[32037] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:47:54 2019 nfs_ctx_key_to_expire()=0 proc: echo[32046] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Thu Aug  1 14:47:54 2019 nfs_ctx_key_to_expire()=0 proc: echo[32046] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
I couldn’t find anything else in the source code which would end up calling the above functions or call_refresh() – like an extra thread which periodically wakes up and refreshes all keys before they end up with the RPC_CRED_KEY_EXPIRE_SOON set. Looks like the current behaviour is, that once a cred/key gets into 240s period before it is set to expire, all writes are no longer buffereed and are being written wynchronously on a page-by-page (4KB) basis, additionally all dirt pages are being flushed. Once the cred expires it gets refreshed and if successful all will switch back to normal. Until the next 240s window before expiry.

I tried a quick “fix” which works ok on 5.5.0-rc2:
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index d75fddc..6d36ae6 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1478,6 +1478,8 @@ static void gss_pipe_free(struct gss_pipe *p)
        ctx = rcu_dereference(gss_cred->gc_ctx);
        if (!ctx || time_after(timeout, ctx->gc_expiry))
                ret = -EACCES;
+       if (ctx && time_after(timeout + (60 * HZ), ctx->gc_expiry))
+               clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags);
        rcu_read_unlock();

        return ret;
This is NOT a final solution and is wrong in a general case.

I also wrote a systemtap script to modify a kernel on-the fly which essentially does the same as the patch in previous update (again, this is not a correct solution and is potentially risky).
# cat nfs_ctx_key_to_expire_workaround3.stp
#!/usr/bin/env stap


global RPC_CRED_KEY_EXPIRE_SOON = 2; // bit 1
global RPC_CRED_NOTIFY_TIMEOUT = 4;  // bit 2

global IN;
global AC_FLAGS;

global LOG = 0;

@define AC_FLAGS_TO_STR(ac_flags) %(
    " " .
    (@ac_flags & RPC_CRED_KEY_EXPIRE_SOON ? "RPC_CRED_KEY_EXPIRE_SOON" : "") .
    (@ac_flags & RPC_CRED_NOTIFY_TIMEOUT ? "RPC_CRED_NOTIFY_TIMEOUT" : "") .
    " "
%)

probe begin {
    if ($# > 0 && @1 == "log")
        LOG = 1;
}

probe module("nfs").function("nfs_ctx_key_to_expire") {
    IN[tid()] = 1;
}


probe module("nfs").function("nfs_ctx_key_to_expire").return {
    if ($return == 0) {
        delete IN[tid()];
        delete AC_FLAGS[tid()];
        next;
    }

    if (LOG) {
        // probefunc() returns outside function in .return and pp() is too verbose, so let's hard-code string
        printf("%s %s()=%d proc: %s[%d] uid: %d ", ctime(gettimeofday_s()), "nfs_ctx_key_to_expire", $return, execname(), pid(), uid());
        printf("ac_flags: %d [%s]\n", AC_FLAGS[tid()], @AC_FLAGS_TO_STR(AC_FLAGS[tid()]));
    }

    //print_backtrace();
    //printf("\n");

%( guru_mode == 1 %?
    if (LOG)
        printf("%s %s PATCHING return code to 0\n", ctime(gettimeofday_s()), "nfs_ctx_key_to_expire");
    $return = 0;
%:
    printf("%s %s NON-GURU MODE, would have patched rc=0\n", ctime(gettimeofday_s()), "nfs_ctx_key_to_expire");
%)

    delete AC_FLAGS[tid()];                                                                                                                                                                                                          [0/1861]
    delete IN[tid()];
}


probe module("sunrpc").function("generic_key_to_expire").return {
    if (IN[tid()]) {
        acred = & @container_of(@entry($cred), "struct generic_cred", gc_base)->acred;
        AC_FLAGS[tid()] = acred->ac_flags;

/*
        flags_s = " ";
        flags_s = flags_s . (acred->ac_flags & RPC_CRED_KEY_EXPIRE_SOON ? "RPC_CRED_KEY_EXPIRE_SOON" : "")
        flags_s = flags_s . (acred->ac_flags & RPC_CRED_NOTIFY_TIMEOUT ? "RPC_CRED_NOTIFY_TIMEOUT" : "")
        printf("ac_flags: %d [%s ]\n", acred->ac_flags, flags_s);
*/
    }
}


probe module("auth_rpcgss").function("gss_key_timeout") {
        gss_cred = & @container_of($rc, "struct gss_cred", gc_base);
        if (gss_cred == 0 || gss_cred->gc_ctx == 0) {
            if (LOG)
                printf("%s %s NOT PATCHING %s[%d], cr_uid: %d, cr_flags: %d, gss_cred: %x, gss_cred->gc_ctx: %x\n",
                        ctime(gettimeofday_s()),
                        "gss_key_timeout",
                        execname(), pid(),
                        $rc->cr_uid,
                        $rc->cr_flags,
                        gss_cred,
                        gss_cred ? gss_cred->gc_ctx : 0);
            next;
        }

        dt = (gss_cred->gc_ctx->gc_expiry - jiffies())/HZ();

%( guru_mode == 1 %?
        // define RPCAUTH_CRED_UPTODATE 1
        // this is a bit number, so it is 2 in decimal
        if (dt < 300 && $rc->cr_flags & 2) {
        //  new_flags = $rc->cr_flags - 2;
            new_flags = $rc->cr_flags & ~(1 << 1);
            if (LOG)
                printf("%s %s patching %s[%d], cr_uid: %d, dt: %d cr_flags: %d new cr_flags: %d\n",
                        ctime(gettimeofday_s()),
                        "gss_key_timeout",
                        execname(), pid(),
                        $rc->cr_uid,
                        dt,
                        $rc->cr_flags,
                        new_flags);
            //$rc->cr_flags = new_flags;
            $rc->cr_flags = $rc->cr_flags & ~(1 << 1);
            //gss_cred->gc_ctx->gc_expiry = 360 * HZ();
        }
%:
        // just print some info in non-guru mode
        if (!(dt < 300 && $rc->cr_flags & 2))
            next;

        printf("%s %s[%d], cr_uid: %d dt: %d cr_flags: %d\n",
                ctime(gettimeofday_s()),
                execname(), pid(),
                $rc->cr_uid,
                dt,
                $rc->cr_flags);
%)
}
The script contains some left-overs from initial debugging and some of it is not necessary for the workaround to work.
To run it:
# ./nfs_ctx_key_to_expire_workaround3.stp -g --suppress-time-limits nolog &

Notice, that running the workaround means any dirty pages are not destaged in the last 240s before key expiry and all writes are still buffered - if for any reason the nfs client won't manage to refresh the key then all the dirty pages might never destage.

The current behaviour is rather an lazy approach to fix the issue - I believe a much better solution would be to try to refresh they key a minut earlier - if it is unsuccesful then keep using the previous key and fallback to current behaviour, otherwise reset the timer and keep using new key. This would mean that in a vast majority of cases the 4 minutes of slowness would never happen.

Unfortunately I haven’t got time to work on a proper fix (and for different reasons the issue became less noticeable for us, so it ended up being a low priority).

btw: the kprobe script was obtained from https://github.com/brendangregg/perf-tools

2 comments:

Anonymous said...

nice work

Paul Fox said...

Great writeup Rob