It’s a blog entry I started writing over 3 years ago and never managed to actually finish/publish it, until now.
I haven’t checked in a while if latest Linux kernel is still affected by the described behaviour here, but I suspect it is.
An application running on CentOS 6.9 - it often copies files (4-6GB in size on average) from a local SSD based cache to NFSv4 (+krb) mount. Developers complained that twice a day the copy takes significantly longer to complete without any obvious reason. They even added logging the their code to log all occurrences when a copy of a 64MB chunk takes longer than 10,000ms.
After some quick testing I determined that reads are fast, but writes sometimes would be very slow, despite them being buffered writes on a server with lots of free memory. No obvious bottlenecks on network, no packet drops, etc.
Initially I suspected write back throttling, but quickly realised this was not the culprit.
After tracing the java code I determined that the application ends up calling sendfile() with 64MB chunks. Nothing unusual here, but just in case there is some sendfile() specific behaviour here, I tried to reproduce the issue with read()/write() loop with 64MB chunks and hit the same issue. The issue also occurred with smaller chunk sizes.
A test version of the app (with only the relevant/affected file copy code):
$ strace -tT -v -f -o /tmp/a java CopyTest /.../f1 /mnt/.../tmp/f1 Copying a chunk of 67076096 bytes from position 0 in /…/f1 to /…/tmp/f1 took 28005 ms (2.3 MB/s) Switching copy chunk size to 33554432 bytes (32 MB) Copying a chunk of 33554432 bytes from position 67076096 in /…/f1 to /…/tmp/f1 took 14161 ms (2.3 MB/s) Switching copy chunk size to 16777216 bytes (16 MB) Copying a chunk of 16777216 bytes from position 100630528 in /…/f1 to /…/tmp/f1 took 7555 ms (2.1 MB/s) Copying a chunk of 16777216 bytes from position 117407744 in /…/f1 to /…/tmp/f1 took 7111 ms (2.3 MB/s) Copying a chunk of 16777216 bytes from position 134184960 in /…/f1 to /…/tmp/f1 took 6595 ms (2.4 MB/s) ^C
The app ends up calling sendfile() in a loop with 64MB block (or smaller if it detects it is slow), which should be ok. However each sendfile() takes quite a long time to finish (when the issue was happening):
15212 15:58:25 open("/…/f1", O_RDONLY) = 4 <0.000038> ... 15212 15:58:25 open("/…/tmp/f1", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6 <0.003453> ... 15212 15:58:25 sendfile(6, 4, [0], 67076096 <unfinished ...> 15212 15:58:53 <... sendfile resumed> ) = 67076096 <28.004107> ... 15212 15:58:53 sendfile(6, 4, [67076096], 33554432 <unfinished ...> 15212 15:59:07 <... sendfile resumed> ) = 33554432 <14.160558> ... 15212 15:59:07 sendfile(6, 4, [100630528], 16777216 <unfinished ...> 15212 15:59:14 <... sendfile resumed> ) = 16777216 <7.555443> ... 15212 15:59:14 sendfile(6, 4, [117407744], 16777216 <unfinished ...> 15212 15:59:22 <... sendfile resumed> ) = 16777216 <7.110757> ...
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/s rops/s wops/s hostxxx:/mount/point 0.00 3968.00 0.00 0.00 0.00 3936.00 509.00 0.00 492.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/s rops/s wops/s hostxxx:/mount/point 0.00 3968.00 0.00 0.00 0.00 4040.00 581.00 0.00 505.00
No network congestion, no increased load on the nfs servers during the slowdowns, etc.
The nfsiostat output (while writes are slow):
# nfsiostat /mnt/…/ 1 hostxxx:/mount/point mounted on /mnt/…/: op/s rpc bklog 251.20 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 38.022 2419.958 63.647 320 (0.0%) 5.031 7.105 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 106.765 5446.704 51.016 40 (0.0%) 3.165 4799.198 hostxxx:/mount/point mounted on /mnt/…/: op/s rpc bklog 476.00 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 476.000 2076.922 4.363 0 (0.0%) 2.025 2.092 hostxxx:/mount/point mounted on /mnt/…/: op/s rpc bklog 494.00 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 494.000 2155.461 4.363 0 (0.0%) 1.996 2.026 hostxxx:/mount/point mounted on /mnt/…/: op/s rpc bklog 367.00 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 322.000 1404.977 4.363 0 (0.0%) 3.099 3.130 hostxxx:/mount/point mounted on /mnt/…/: op/s rpc bklog 453.00 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 453.000 1976.566 4.363 0 (0.0%) 2.179 2.212 hostxxx:/mount/point mounted on /mnt/…/: op/s rpc bklog 437.00 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 437.000 1906.754 4.363 0 (0.0%) 2.256 2.288 ^C
hostxxx:/mount/point mounted on /mnt/…/: op/s rpc bklog 5178.00 33515.07 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 5032.000 323876.031 64.363 0 (0.0%) 2.908 13164.891 hostxxx:/mount/point mounted on /mnt/…/: op/s rpc bklog 4830.00 28661.95 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 4672.000 300705.250 64.363 0 (0.0%) 3.191 14171.388 hostxxx:/mount/point mounted on /mnt/.../: op/s rpc bklog 5030.00 23886.62 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 4875.000 313770.996 64.363 0 (0.0%) 3.028 15096.526
Fast/good case:
64477 17:31:46 read(3, "..."..., 65536) = 65536 <0.000028> 64477 17:31:46 write(4, "..."..., 65536) = 65536 <0.000048> 64477 17:31:46 read(3, "..."..., 65536) = 65536 <0.000028> 64477 17:31:46 write(4, "..."..., 65536) = 65536 <0.000058> 64477 17:31:46 read(3, "..."..., 65536) = 65536 <0.000028> 64477 17:31:46 write(4, "..."..., 65536) = 65536 <0.000059> 64477 17:31:46 read(3, "..."..., 65536) = 65536 <0.000029> 64477 17:31:46 write(4, "..."..., 65536) = 65536 <0.000052>
62376 17:27:02 read(3, "..."..., 65536) = 65536 <0.000060>
62376 17:27:02 write(4, "..."..., 65536) = 65536 <0.021412>
62376 17:27:02 read(3, "..."..., 65536) = 65536 <0.000046>
62376 17:27:02 write(4, "..."..., 65536) = 65536 <0.022443>
62376 17:27:02 read(3, "..."..., 65536) = 65536 <0.000050>
62376 17:27:02 write(4, "..."..., 65536) = 65536 <0.032568>
62376 17:27:02 read(3, "..."..., 65536) = 65536 <0.000054>
62376 17:27:02 write(4, "..."..., 65536) = 65536 <0.038464>
# perf script -i perf.data.s3 Failed to open /tmp/perf-7054.map, continuing without symbols java 7055 [040] 43655512.790932: writeback:wbc_writepage: bdi 0:484: towrt=9223372036854775807 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 more=0 older=0x0 start=0x0 end=0x7fffffffffffffff ffffffff811c9046 ftrace_profile_wbc_writepage ([kernel.kallsyms]) ffffffff811441da write_cache_pages ([kernel.kallsyms]) 199d2 nfs_writepages ([nfs]) ffffffff81144291 do_writepages ([kernel.kallsyms]) ffffffff811c7a7d writeback_single_inode ([kernel.kallsyms]) ffffffff811c7c63 sync_inode ([kernel.kallsyms]) 18983 nfs_wb_all ([nfs]) 8a8c nfs_write_end ([nfs]) ffffffff8112e210 pagecache_write_end ([kernel.kallsyms]) ffffffff811cb0a6 pipe_to_file ([kernel.kallsyms]) ffffffff811c9ce2 splice_from_pipe_feed ([kernel.kallsyms]) ffffffff811cadbe generic_file_splice_write_actor ([kernel.kallsyms]) ffffffff811cae9c splice_write_to_file ([kernel.kallsyms]) ffffffff811caf45 generic_file_splice_write ([kernel.kallsyms]) 79c6 nfs_file_splice_write ([nfs]) ffffffff811ca96a do_splice_from ([kernel.kallsyms]) ffffffff811ca9c0 direct_splice_actor ([kernel.kallsyms]) ffffffff811cabf6 splice_direct_to_actor ([kernel.kallsyms]) ffffffff811cad3d do_splice_direct ([kernel.kallsyms]) ffffffff8119abc4 do_sendfile ([kernel.kallsyms]) ffffffff8119aca1 sys_sendfile64 ([kernel.kallsyms]) ffffffff81556951 tracesys ([kernel.kallsyms]) df81a sendfile (/lib64/libc-2.12.so) 7fb4bcbd07a7 [unknown] (/tmp/perf-7054.map) 7fb4bcbbfe50 [unknown] (/tmp/perf-7054.map) 7fb4bcbbfe50 [unknown] (/tmp/perf-7054.map) 7fb4bcbbfe50 [unknown] (/tmp/perf-7054.map) 7fb4bcbbfe50 [unknown] (/tmp/perf-7054.map) 7fb4bcbb87a7 [unknown] (/tmp/perf-7054.map) 697a76 JavaCalls::call_helper (/.../pkgs/jdk1.8.0_181/jre/lib/amd64/server/libjvm.so) 6d91c2 jni_invoke_static (/.../pkgs/jdk1.8.0_181/jre/lib/amd64/server/libjvm.so) 6f5a2a jni_CallStaticVoidMethod (/.../pkgs/jdk1.8.0_181/jre/lib/amd64/server/libjvm.so) 80ff JavaMain (/.../pkgs/jdk1.8.0_181/lib/amd64/jli/libjli.so) 7aa1 start_thread (/lib64/libpthread-2.12.so)
... ffffffff811c7c63 sync_inode ([kernel.kallsyms]) 18983 nfs_wb_all ([nfs]) 8a8c nfs_write_end ([nfs]) ...
./fs/nfs/file.c: 417 static int nfs_write_end(struct file *file, struct address_space *mapping, … 458 if (nfs_ctx_key_to_expire(ctx, mapping->host)) { 459 status = nfs_wb_all(mapping->host); 460 if (status < 0) 461 return status; 462 }
./fs/nfs/file.c: 417 static int nfs_write_end(struct file *file, struct address_space *mapping, … 458 if (nfs_ctx_key_to_expire(ctx, mapping->host)) { 459 status = nfs_wb_all(mapping->host); 460 if (status < 0) 461 return status; 462 } ./fs/nfs/write.c: bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct inode *inode) { struct rpc_auth *auth = NFS_SERVER(inode)->client->cl_auth; return rpcauth_cred_key_to_expire(auth, ctx->cred); } ./net/sunrpc/auth.c: bool rpcauth_cred_key_to_expire(struct rpc_auth *auth, struct rpc_cred *cred) { if (auth->au_flags & RPCAUTH_AUTH_NO_CRKEY_TIMEOUT) return false; if (!cred->cr_ops->crkey_to_expire) return false; return cred->cr_ops->crkey_to_expire(cred); } ./net/sunrpc/auth_generic.c: .crkey_to_expire = generic_key_to_expire, ./net/sunrpc/auth_generic.c: static bool generic_key_to_expire(struct rpc_cred *cred) { struct auth_cred *acred = &container_of(cred, struct generic_cred, gc_base)->acred; bool ret; get_rpccred(cred); ret = test_bit(RPC_CRED_KEY_EXPIRE_SOON, &acred->ac_flags); put_rpccred(cred); return ret; } static const struct rpc_credops generic_credops = { .cr_name = "Generic cred", .crdestroy = generic_destroy_cred, .crbind = generic_bind_cred, .crmatch = generic_match, .crkey_to_expire = generic_key_to_expire, }; ./net/sunrpc/auth_generic.c 202 generic_key_timeout(struct rpc_auth *auth, struct rpc_cred *cred) 224 ret = tcred->cr_ops->crkey_timeout(tcred); 225 if (ret != 0) { 226 set_bit(RPC_CRED_KEY_EXPIRE_SOON, &acred->ac_flags); 227 ret = 0; 228 } else { 229 /* In case underlying cred key has been reset */ 230 if (test_and_clear_bit(RPC_CRED_KEY_EXPIRE_SOON, 231 &acred->ac_flags)) 232 dprless intk("RPC: UID %d Credential key reset\n", 233 tcred->cr_uid); 234 /* set up fasttrack for the normal case */ 235 set_bit(RPC_CRED_NOTIFY_TIMEOUT, &acred->ac_flags); 236 } ./net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout = gss_key_timeout, ./net/sunrpc/auth_gss/auth_gss.c: ./net/sunrpc/auth_gss/auth_gss.c: #define GSS_KEY_EXPIRE_TIMEO 240 static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO; static int gss_key_timeout(struct rpc_cred *rc) { struct gss_cred *gss_cred = container_of(rc, struct gss_cred, gc_base); unsigned long now = jiffies; unsigned long expire; if (gss_cred->gc_ctx == NULL) return -EACCES; expire = gss_cred->gc_ctx->gc_expiry - (gss_key_expire_timeo * HZ); if (time_after(now, expire)) return -EACCES; return 0; }
# ./kprobe -d 1 'r:nfs_ctx_key_to_expire $retval' Tracing kprobe nfs_ctx_key_to_expire for 1 seconds (buffered)... <...>-80656 [005] 21002.939170: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.940730: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.942151: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.943556: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.944987: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.946327: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.947668: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.949157: nfs_ctx_key_to_expire: (nfs_file_splice_write+0xfd/0x140 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.949171: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.950754: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.952075: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 <...>-80656 [005] 21002.953633: nfs_ctx_key_to_expire: (nfs_write_end+0x18f/0x300 [nfs] <- nfs_ctx_key_to_expire) arg1=1 … # ./kprobe -d 1 'r:rpcauth_cred_key_to_expire $retval' <...>-80656 [005] 21511.495978: rpcauth_cred_key_to_expire: (nfs_ctx_key_to_expire+0x2b/0x30 [nfs] <- rpcauth_cred_key_to_expire) arg1=1 <...>-80656 [005] 21511.497206: rpcauth_cred_key_to_expire: (nfs_ctx_key_to_expire+0x2b/0x30 [nfs] <- rpcauth_cred_key_to_expire) arg1=1 <...>-80656 [005] 21511.498429: rpcauth_cred_key_to_expire: (nfs_ctx_key_to_expire+0x2b/0x30 [nfs] <- rpcauth_cred_key_to_expire) arg1=1 <...>-80656 [005] 21511.498443: rpcauth_cred_key_to_expire: (nfs_ctx_key_to_expire+0x2b/0x30 [nfs] <- rpcauth_cred_key_to_expire) arg1=1 … # ./kprobe -d 1 'r:generic_key_to_expire $retval' <...>-80656 [005] 21663.712048: generic_key_to_expire: (rpcauth_cred_key_to_expire+0x21/0x30 [sunrpc] <- generic_key_to_expire) arg1=1 <...>-80656 [005] 21663.713661: generic_key_to_expire: (rpcauth_cred_key_to_expire+0x21/0x30 [sunrpc] <- generic_key_to_expire) arg1=1 <...>-80656 [005] 21663.715152: generic_key_to_expire: (rpcauth_cred_key_to_expire+0x21/0x30 [sunrpc] <- generic_key_to_expire) arg1=1 <...>-80656 [005] 21663.716583: generic_key_to_expire: (rpcauth_cred_key_to_expire+0x21/0x30 [sunrpc] <- generic_key_to_expire) arg1=1 …
static bool generic_key_to_expire(struct rpc_cred *cred) { struct auth_cred *acred = &container_of(cred, struct generic_cred, gc_base)->acred; bool ret; get_rpccred(cred); ret = test_bit(RPC_CRED_KEY_EXPIRE_SOON, &acred->ac_flags); put_rpccred(cred); return ret; }
cp 34297 [001] 25235.714806: writeback:wbc_writepage: bdi 0:1479: towrt=9223372036854775807 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 more=0 older=0x0 start=0x0 end=0x7fffffffffffffff 7fff811ca046 ftrace_profile_wbc_writepage ([kernel.kallsyms]) 7fff811451da write_cache_pages ([kernel.kallsyms]) 7fffa041b962 nfs_writepages ([kernel.kallsyms]) 7fff81145291 do_writepages ([kernel.kallsyms]) 7fff811c8a7d writeback_single_inode ([kernel.kallsyms]) 7fff811c8c63 sync_inode ([kernel.kallsyms]) 7fffa041a913 nfs_wb_all ([kernel.kallsyms]) 7fffa040aa1c nfs_write_end ([kernel.kallsyms]) 7fff8113087a generic_file_buffered_write ([kernel.kallsyms]) 7fff81132270 __generic_file_aio_write ([kernel.kallsyms]) 7fff81132528 generic_file_aio_write ([kernel.kallsyms]) 7fffa040a3f4 nfs_file_write ([kernel.kallsyms]) 7fff8119b47a do_sync_write ([kernel.kallsyms]) 7fff8119b778 vfs_write ([kernel.kallsyms]) 7fff8119c2b1 sys_write ([kernel.kallsyms]) 7fff81557951 tracesys ([kernel.kallsyms]) db7d0 __write_nocancel (/lib64/libc-2.12.so)
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) … if (file->f_op->write) ret = file->f_op->write(file, buf, count, pos); else ret = do_sync_write(file, buf, count, pos);
commit ddf529eeed59184c49dcad1633c11831f822bf6b Author: NeilBrown <neilb@suse.com> Date: Mon Dec 3 11:30:30 2018 +1100 NFS: move credential expiry tracking out of SUNRPC into NFS. NFS needs to know when a credential is about to expire so that it can modify write-back behaviour to finish the write inside the expiry time. It currently uses functions in SUNRPC code which make use of a fairly complex callback scheme and flags in the generic credientials. As I am working to discard the generic credentials, this has to change. This patch moves the logic into NFS, in part by finding and caching the low-level credential in the open_context. We then make direct cred-api calls on that. This makes the code much simpler and removes a dependency on generic rpc credentials. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I wrote stap script which instruments nfs_ctx_key_to_expire() function and prints some basic information including if RPC_CRED_KEY_EXPIRE_SOON bit is set. It also instruments gss_key_timeout() function and prints current time, gc_expire and delta.
# cat nfs_ctx_key_to_expire.stp #!/usr/bin/env stap global RPC_CRED_KEY_EXPIRE_SOON = 2; // bit 1 global RPC_CRED_NOTIFY_TIMEOUT = 4; // bit 2 global IN; global AC_FLAGS; @define AC_FLAGS_TO_STR(ac_flags) %( " " . (@ac_flags & RPC_CRED_KEY_EXPIRE_SOON ? "RPC_CRED_KEY_EXPIRE_SOON" : "") . (@ac_flags & RPC_CRED_NOTIFY_TIMEOUT ? "RPC_CRED_NOTIFY_TIMEOUT" : "") . " " %) probe module("nfs").function("nfs_ctx_key_to_expire") { IN[tid()] = 1; } probe module("nfs").function("nfs_ctx_key_to_expire").return { // probefunc() returns outside function in .return and pp() is too verbose, so let's hard-code string printf("%s %s()=%d proc: %s[%d] uid: %d ", ctime(gettimeofday_s()), "nfs_ctx_key_to_expire", $return, execname(), pid(), uid()); printf("ac_flags: %d [%s] ", AC_FLAGS[tid()], @AC_FLAGS_TO_STR(AC_FLAGS[tid()])); if ($# > 0 && @1 == "inode_name") { printf("inode_name: %s", inode_name(@entry($inode))); } printf("\n"); delete AC_FLAGS[tid()]; delete IN[tid()]; } probe module("sunrpc").function("generic_key_to_expire").return { if (IN[tid()]) { acred = & @container_of(@entry($cred), "struct generic_cred", gc_base)->acred; AC_FLAGS[tid()] = acred->ac_flags; /* flags_s = " "; flags_s = flags_s . (acred->ac_flags & RPC_CRED_KEY_EXPIRE_SOON ? "RPC_CRED_KEY_EXPIRE_SOON" : "") flags_s = flags_s . (acred->ac_flags & RPC_CRED_NOTIFY_TIMEOUT ? "RPC_CRED_NOTIFY_TIMEOUT" : "") printf("ac_flags: %d [%s ]\n", acred->ac_flags, flags_s); */ } } probe module("auth_rpcgss").function("gss_key_timeout").return { printf("%s %s()=%d proc: %s[%d] uid: %d ", ctime(gettimeofday_s()), "gss_key_timeout", $return, execname(), pid(), uid()); gss_cred = & @container_of(@entry($rc), "struct gss_cred", gc_base); printf("now: %d ", jiffies()); // @entry($now) is broken printf("gc_expire: %d %d ", gss_cred->gc_ctx->gc_expiry, gss_cred->gc_ctx->gc_expiry - 240 * HZ()); printf("delta: %d %d", (gss_cred->gc_ctx->gc_expiry - jiffies())/HZ(), (gss_cred->gc_ctx->gc_expiry - 240 * HZ() - jiffies())/HZ()); printf("\n"); print_backtrace(); }
# vi /etc/sysconfig/nfs #RPCGSSDARGS="-t 1800" RPCGSSDARGS="-t 300" # systemctl restart rpc-gssd
$ while [ 1 ]; do /bin/echo aa >nethome/test ; rm nethome/test; sleep 5; done
# ./nfs_ctx_key_to_expire.stp -o /var/tmp/aa xx & # tail -f /var/tmp/aa ... Thu Aug 1 14:43:33 2019 nfs_ctx_key_to_expire()=0 proc: echo[31151] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:43:33 2019 nfs_ctx_key_to_expire()=0 proc: echo[31151] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:43:38 2019 nfs_ctx_key_to_expire()=0 proc: echo[31158] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:43:38 2019 nfs_ctx_key_to_expire()=0 proc: echo[31158] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:43:43 2019 nfs_ctx_key_to_expire()=0 proc: echo[31165] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:43:43 2019 nfs_ctx_key_to_expire()=0 proc: echo[31165] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:43:48 2019 gss_key_timeout()=-13 proc: echo[31172] uid: 14100 now: 7509867697 gc_expire: 7510107517 7509867517 delta: 239 0 Returning from: 0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss] Returning to : 0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc] 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc] 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs] 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs] 0xffffffff8cc1e763 0xffffffff8cc1f240 0xffffffff8cc2006f 0xffffffff8d12579b Thu Aug 1 14:43:48 2019 nfs_ctx_key_to_expire()=1 proc: echo[31172] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:43:48 2019 nfs_ctx_key_to_expire()=1 proc: echo[31172] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:43:53 2019 gss_key_timeout()=-13 proc: echo[31179] uid: 14100 now: 7509872712 gc_expire: 7510107517 7509867517 delta: 234 -5 Returning from: 0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss] Returning to : 0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc] 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc] 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs] 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs] 0xffffffff8cc1e763 0xffffffff8cc1f240 0xffffffff8cc2006f 0xffffffff8d12579b ... Thu Aug 1 14:47:34 2019 nfs_ctx_key_to_expire()=1 proc: echo[32009] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:34 2019 nfs_ctx_key_to_expire()=1 proc: echo[32009] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:39 2019 gss_key_timeout()=-13 proc: echo[32018] uid: 14100 now: 7510098709 gc_expire: 7510107517 7509867517 delta: 8 -231 Returning from: 0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss] Returning to : 0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc] 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc] 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs] 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs] 0xffffffff8cc1e763 0xffffffff8cc1f240 0xffffffff8cc2006f 0xffffffff8d125a1b Thu Aug 1 14:47:39 2019 nfs_ctx_key_to_expire()=1 proc: echo[32018] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:39 2019 nfs_ctx_key_to_expire()=1 proc: echo[32018] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:44 2019 gss_key_timeout()=-13 proc: echo[32027] uid: 14100 now: 7510103729 gc_expire: 7510107517 7509867517 delta: 3 -236 Returning from: 0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss] Returning to : 0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc] 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc] 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs] 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs] 0xffffffff8cc1e763 0xffffffff8cc1f240 0xffffffff8cc2006f 0xffffffff8d125a1b Thu Aug 1 14:47:44 2019 nfs_ctx_key_to_expire()=1 proc: echo[32027] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:44 2019 nfs_ctx_key_to_expire()=1 proc: echo[32027] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:49 2019 gss_key_timeout()=0 proc: echo[32037] uid: 14100 now: 7510108758 gc_expire: 7510408751 7510168751 delta: 299 59 Returning from: 0xffffffffc03ad000 : gss_key_timeout+0x0/0x50 [auth_rpcgss] Returning to : 0xffffffffc0348e1e : generic_key_timeout+0x5e/0xd0 [sunrpc] 0xffffffffc0347051 : rpcauth_key_timeout_notify+0x21/0x30 [sunrpc] 0xffffffffc04fe9bf : nfs_key_timeout_notify+0x2f/0x40 [nfs] 0xffffffffc04ed939 : nfs_file_write+0x59/0x1e0 [nfs] 0xffffffff8cc1e763 0xffffffff8cc1f240 0xffffffff8cc2006f 0xffffffff8d125a1b Thu Aug 1 14:47:49 2019 nfs_ctx_key_to_expire()=0 proc: echo[32037] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:47:49 2019 nfs_ctx_key_to_expire()=0 proc: echo[32037] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:47:54 2019 nfs_ctx_key_to_expire()=0 proc: echo[32046] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:47:54 2019 nfs_ctx_key_to_expire()=0 proc: echo[32046] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:48:00 2019 nfs_ctx_key_to_expire()=0 proc: echo[32069] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:48:00 2019 nfs_ctx_key_to_expire()=0 proc: echo[32069] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
Just to double check I also measured time it takes to complete each write:
$ while [ 1 ]; do strace -tT -v -etrace=write /bin/echo aa >nethome/test ; rm nethome/test; sleep 5; done 15:47:19 write(1, "aa\n", 3) = 3 <0.001536> 15:47:19 +++ exited with 0 +++ 15:47:24 write(1, "aa\n", 3) = 3 <0.001501> 15:47:24 +++ exited with 0 +++ 15:47:29 write(1, "aa\n", 3) = 3 <0.001427> 15:47:29 +++ exited with 0 +++ 15:47:34 write(1, "aa\n", 3) = 3 <0.001455> 15:47:34 +++ exited with 0 +++ 15:47:39 write(1, "aa\n", 3) = 3 <0.001503> 15:47:39 +++ exited with 0 +++ 15:47:44 write(1, "aa\n", 3) = 3 <0.001542> 15:47:44 +++ exited with 0 +++ 15:47:49 write(1, "aa\n", 3) = 3 <0.000132> 15:47:49 +++ exited with 0 +++ 15:47:54 write(1, "aa\n", 3) = 3 <0.000069> 15:47:54 +++ exited with 0 +++ 15:48:00 write(1, "aa\n", 3) = 3 <0.000338> 15:48:00 +++ exited with 0 +++ ^C
# grep 'Thu Aug 1 14:47' /var/tmp/aa Thu Aug 1 14:47:01 2019 gss_key_timeout()=-13 proc: echo[31942] uid: 14100 now: 7510060000 gc_expire: 7510107517 7509867517 delta: 47 -192 Thu Aug 1 14:47:01 2019 nfs_ctx_key_to_expire()=1 proc: echo[31942] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:01 2019 nfs_ctx_key_to_expire()=1 proc: echo[31942] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:06 2019 gss_key_timeout()=-13 proc: echo[31951] uid: 14100 now: 7510065018 gc_expire: 7510107517 7509867517 delta: 42 -197 Thu Aug 1 14:47:06 2019 nfs_ctx_key_to_expire()=1 proc: echo[31951] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:06 2019 nfs_ctx_key_to_expire()=1 proc: echo[31951] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:11 2019 gss_key_timeout()=-13 proc: echo[31960] uid: 14100 now: 7510070037 gc_expire: 7510107517 7509867517 delta: 37 -202 Thu Aug 1 14:47:11 2019 nfs_ctx_key_to_expire()=1 proc: echo[31960] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:11 2019 nfs_ctx_key_to_expire()=1 proc: echo[31960] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:16 2019 gss_key_timeout()=-13 proc: echo[31972] uid: 14100 now: 7510075056 gc_expire: 7510107517 7509867517 delta: 32 -207 Thu Aug 1 14:47:16 2019 nfs_ctx_key_to_expire()=1 proc: echo[31972] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:16 2019 nfs_ctx_key_to_expire()=1 proc: echo[31972] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:19 2019 gss_key_timeout()=-13 proc: echo[31981] uid: 14100 now: 7510078637 gc_expire: 7510107517 7509867517 delta: 28 -211 Thu Aug 1 14:47:19 2019 nfs_ctx_key_to_expire()=1 proc: echo[31981] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:19 2019 nfs_ctx_key_to_expire()=1 proc: echo[31981] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:24 2019 gss_key_timeout()=-13 proc: echo[31991] uid: 14100 now: 7510083655 gc_expire: 7510107517 7509867517 delta: 23 -216 Thu Aug 1 14:47:24 2019 nfs_ctx_key_to_expire()=1 proc: echo[31991] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:24 2019 nfs_ctx_key_to_expire()=1 proc: echo[31991] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:29 2019 gss_key_timeout()=-13 proc: echo[32000] uid: 14100 now: 7510088673 gc_expire: 7510107517 7509867517 delta: 18 -221 Thu Aug 1 14:47:29 2019 nfs_ctx_key_to_expire()=1 proc: echo[32000] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:29 2019 nfs_ctx_key_to_expire()=1 proc: echo[32000] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:34 2019 gss_key_timeout()=-13 proc: echo[32009] uid: 14100 now: 7510093690 gc_expire: 7510107517 7509867517 delta: 13 -226 Thu Aug 1 14:47:34 2019 nfs_ctx_key_to_expire()=1 proc: echo[32009] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:34 2019 nfs_ctx_key_to_expire()=1 proc: echo[32009] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:39 2019 gss_key_timeout()=-13 proc: echo[32018] uid: 14100 now: 7510098709 gc_expire: 7510107517 7509867517 delta: 8 -231 Thu Aug 1 14:47:39 2019 nfs_ctx_key_to_expire()=1 proc: echo[32018] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:39 2019 nfs_ctx_key_to_expire()=1 proc: echo[32018] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:44 2019 gss_key_timeout()=-13 proc: echo[32027] uid: 14100 now: 7510103729 gc_expire: 7510107517 7509867517 delta: 3 -236 Thu Aug 1 14:47:44 2019 nfs_ctx_key_to_expire()=1 proc: echo[32027] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:44 2019 nfs_ctx_key_to_expire()=1 proc: echo[32027] uid: 14100 ac_flags: 2 [ RPC_CRED_KEY_EXPIRE_SOON ] Thu Aug 1 14:47:49 2019 gss_key_timeout()=0 proc: echo[32037] uid: 14100 now: 7510108758 gc_expire: 7510408751 7510168751 delta: 299 59 Thu Aug 1 14:47:49 2019 nfs_ctx_key_to_expire()=0 proc: echo[32037] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:47:49 2019 nfs_ctx_key_to_expire()=0 proc: echo[32037] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:47:54 2019 nfs_ctx_key_to_expire()=0 proc: echo[32046] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ] Thu Aug 1 14:47:54 2019 nfs_ctx_key_to_expire()=0 proc: echo[32046] uid: 14100 ac_flags: 4 [ RPC_CRED_NOTIFY_TIMEOUT ]
I tried a quick “fix” which works ok on 5.5.0-rc2:
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c index d75fddc..6d36ae6 100644 --- a/net/sunrpc/auth_gss/auth_gss.c +++ b/net/sunrpc/auth_gss/auth_gss.c @@ -1478,6 +1478,8 @@ static void gss_pipe_free(struct gss_pipe *p) ctx = rcu_dereference(gss_cred->gc_ctx); if (!ctx || time_after(timeout, ctx->gc_expiry)) ret = -EACCES; + if (ctx && time_after(timeout + (60 * HZ), ctx->gc_expiry)) + clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags); rcu_read_unlock(); return ret;
I also wrote a systemtap script to modify a kernel on-the fly which essentially does the same as the patch in previous update (again, this is not a correct solution and is potentially risky).
# cat nfs_ctx_key_to_expire_workaround3.stp #!/usr/bin/env stap global RPC_CRED_KEY_EXPIRE_SOON = 2; // bit 1 global RPC_CRED_NOTIFY_TIMEOUT = 4; // bit 2 global IN; global AC_FLAGS; global LOG = 0; @define AC_FLAGS_TO_STR(ac_flags) %( " " . (@ac_flags & RPC_CRED_KEY_EXPIRE_SOON ? "RPC_CRED_KEY_EXPIRE_SOON" : "") . (@ac_flags & RPC_CRED_NOTIFY_TIMEOUT ? "RPC_CRED_NOTIFY_TIMEOUT" : "") . " " %) probe begin { if ($# > 0 && @1 == "log") LOG = 1; } probe module("nfs").function("nfs_ctx_key_to_expire") { IN[tid()] = 1; } probe module("nfs").function("nfs_ctx_key_to_expire").return { if ($return == 0) { delete IN[tid()]; delete AC_FLAGS[tid()]; next; } if (LOG) { // probefunc() returns outside function in .return and pp() is too verbose, so let's hard-code string printf("%s %s()=%d proc: %s[%d] uid: %d ", ctime(gettimeofday_s()), "nfs_ctx_key_to_expire", $return, execname(), pid(), uid()); printf("ac_flags: %d [%s]\n", AC_FLAGS[tid()], @AC_FLAGS_TO_STR(AC_FLAGS[tid()])); } //print_backtrace(); //printf("\n"); %( guru_mode == 1 %? if (LOG) printf("%s %s PATCHING return code to 0\n", ctime(gettimeofday_s()), "nfs_ctx_key_to_expire"); $return = 0; %: printf("%s %s NON-GURU MODE, would have patched rc=0\n", ctime(gettimeofday_s()), "nfs_ctx_key_to_expire"); %) delete AC_FLAGS[tid()]; [0/1861] delete IN[tid()]; } probe module("sunrpc").function("generic_key_to_expire").return { if (IN[tid()]) { acred = & @container_of(@entry($cred), "struct generic_cred", gc_base)->acred; AC_FLAGS[tid()] = acred->ac_flags; /* flags_s = " "; flags_s = flags_s . (acred->ac_flags & RPC_CRED_KEY_EXPIRE_SOON ? "RPC_CRED_KEY_EXPIRE_SOON" : "") flags_s = flags_s . (acred->ac_flags & RPC_CRED_NOTIFY_TIMEOUT ? "RPC_CRED_NOTIFY_TIMEOUT" : "") printf("ac_flags: %d [%s ]\n", acred->ac_flags, flags_s); */ } } probe module("auth_rpcgss").function("gss_key_timeout") { gss_cred = & @container_of($rc, "struct gss_cred", gc_base); if (gss_cred == 0 || gss_cred->gc_ctx == 0) { if (LOG) printf("%s %s NOT PATCHING %s[%d], cr_uid: %d, cr_flags: %d, gss_cred: %x, gss_cred->gc_ctx: %x\n", ctime(gettimeofday_s()), "gss_key_timeout", execname(), pid(), $rc->cr_uid, $rc->cr_flags, gss_cred, gss_cred ? gss_cred->gc_ctx : 0); next; } dt = (gss_cred->gc_ctx->gc_expiry - jiffies())/HZ(); %( guru_mode == 1 %? // define RPCAUTH_CRED_UPTODATE 1 // this is a bit number, so it is 2 in decimal if (dt < 300 && $rc->cr_flags & 2) { // new_flags = $rc->cr_flags - 2; new_flags = $rc->cr_flags & ~(1 << 1); if (LOG) printf("%s %s patching %s[%d], cr_uid: %d, dt: %d cr_flags: %d new cr_flags: %d\n", ctime(gettimeofday_s()), "gss_key_timeout", execname(), pid(), $rc->cr_uid, dt, $rc->cr_flags, new_flags); //$rc->cr_flags = new_flags; $rc->cr_flags = $rc->cr_flags & ~(1 << 1); //gss_cred->gc_ctx->gc_expiry = 360 * HZ(); } %: // just print some info in non-guru mode if (!(dt < 300 && $rc->cr_flags & 2)) next; printf("%s %s[%d], cr_uid: %d dt: %d cr_flags: %d\n", ctime(gettimeofday_s()), execname(), pid(), $rc->cr_uid, dt, $rc->cr_flags); %) }
To run it:
# ./nfs_ctx_key_to_expire_workaround3.stp -g --suppress-time-limits nolog &
Notice, that running the workaround means any dirty pages are not destaged in the last 240s before key expiry and all writes are still buffered - if for any reason the nfs client won't manage to refresh the key then all the dirty pages might never destage.
The current behaviour is rather an lazy approach to fix the issue - I believe a much better solution would be to try to refresh they key a minut earlier - if it is unsuccesful then keep using the previous key and fallback to current behaviour, otherwise reset the timer and keep using new key. This would mean that in a vast majority of cases the 4 minutes of slowness would never happen.
Unfortunately I haven’t got time to work on a proper fix (and for different reasons the issue became less noticeable for us, so it ended up being a low priority).
btw: the kprobe script was obtained from https://github.com/brendangregg/perf-tools
nice work
ReplyDeleteGreat writeup Rob
ReplyDelete