Friday, April 30, 2021

Linux Tickless Kernel & CPU Usage Statistics

  I run into an interesting problem yesterday. After changing some kernel boot parameters to enable tickless kernel, an application which is supposed to essentially "spin" 100% on a CPU was being reported as being mostly idle. This was rather puzzling.

After some debugging and reading docs I've managed to reproduce the issue with a simple test case. The test is reproducible on CentOS 8.3 x86_64.

$ uname -r
4.18.0-240.1.1.el8_3.x86_64

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/vmlinuz-4.18.0-240.1.1.el8_3.x86_64 root=/dev/mapper/sysvg-root ro crashkernel=auto
           resume=/dev/mapper/sysvg-swap rd.lvm.lv=sysvg/root rd.lvm.lv=sysvg/swap biosdevname=1 ipv6.disable=1
           net.ifnames=1 rhgb quiet intel_pstate=passive
           intel_idle.max_cstate=0 idle=poll
           isolcpus=domain,managed_irq,15-17 irqaffinity=0-13 nohz_full=15

Let's run a single threaded tight loop bound to CPU 16 which is 100% user space (notice the CPU 16 is isolated otherwise):
# time taskset -c 16 sh -c "i=1; while [ \$i -lt 10000000 ]; do true; ((i++)); done"

real    0m30.713s
user    0m30.688s
sys     0m0.001s
In another terminal I run mpstat showing CPU 16 usage statistics:
$ mpstat -P 16 1
...
00:55:14     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:15      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

00:55:15     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:16      16   68.32    0.00    0.00    0.00    0.99    0.00    0.00    0.00    0.00   30.69

00:55:16     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:17      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

00:55:17     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:18      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

00:55:18     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:19      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

00:55:19     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:20      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

00:55:20     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:21      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
...
00:55:45     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:46      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

00:55:46     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:47      16    2.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.00

00:55:47     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
00:55:48      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
^C
As expected CPU 16 was 100% in USR for about 30s.
So far so good.
 
I then modified kernel boot arguments:
  • added nohz option to isolcpus
  • updated nohz_full variable to include CPUs 15-17 (which includes the CPU 16 we are testing here)
After reboot I repeated the above test.
# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/vmlinuz-4.18.0-240.1.1.el8_3.x86_64 root=/dev/mapper/sysvg-root ro crashkernel=auto
           resume=/dev/mapper/sysvg-swap rd.lvm.lv=sysvg/root rd.lvm.lv=sysvg/swap biosdevname=1 ipv6.disable=1
           net.ifnames=1 rhgb quiet intel_pstate=passive
           intel_idle.max_cstate=0 idle=poll
           isolcpus=nohz,domain,managed_irq,15-17 irqaffinity=0-13 nohz_full=15-17
# date;time taskset -c 16 sh -c "i=1; while [ \$i -lt 10000000 ]; do true; ((i++)); done"
Fri 30 Apr 01:11:15 BST 2021

real    0m30.849s
user    0m30.849s
sys     0m0.000s
It took about the same time for it to finish (as expected). However mpstat output is now different:
$ mpstat -P 16 1
...
01:11:13     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:14      16    0.00    0.00    0.00    0.00    0.99    0.00    0.00    0.00    0.00   99.01

01:11:14     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:15      16   23.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   76.06

01:11:15     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:16      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:16     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:17      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:17     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:18      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:18     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:19      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:19     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:20      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:20     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:21      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:21     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:22      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:22     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:23      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:23     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:24      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:24     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:25      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:25     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:26      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:26     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:27      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:27     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:28      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:28     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:29      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:29     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:30      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:30     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:31      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:31     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:32      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:32     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:33      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:33     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:34      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:34     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:35      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:35     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:36      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:36     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:37      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:37     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:38      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:38     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:39      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:39     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:40      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:40     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:41      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:41     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:42      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:42     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:43      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:43     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:44      16  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:11:44     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:45      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:45     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:46      16   76.17    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   23.83

01:11:46     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:47      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:47     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:48      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:48     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:49      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

01:11:49     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:11:50      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
^C

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:      16   83.36    0.00    0.00    0.00    0.03    0.00    0.00    0.00    0.00   16.62

Notice that most of the time while the above loop was spinning on the CPU, mpstat reported the CPU as idle.

Given the loop finished in about the same time as before it was not taken off CPU and had to constantly spin as before.  Also the real time and usr time are similar as reported by time. This means that mpstat reporting the CPU as being IDLE most of the time can't be true, in fact during the 30s run it had to be spinning 100% in USR.

There is an excelent overview on cpu isolation and tickless kernel by SUSE Labs which sheds some light on what and why is happening here.
Specifically check the Part 2 -> 3.1 Cputime accounting section.

Essentially what seems to be happennig here is that CPU times are only updated when CPU enters or exits user space. This results in a more accurate usage numbers, and more importantly does not require to interrupt an running application, which is the whole point of tickless kernel.

But what if an application does not issue any system calls and ends up spinning 100% in user space?
The CPU stats would never get updated... but they actually do every few seconds.
I'm not entirely sure why - probably there is some mechanism to gather them at some interval or there is an IRQ happening (shouldn't be) or... not that important at the moment.

To further confirm that the cpu usage stats are not being updated due to the application not switching to kernel mode and then back to user mode, let's modify the loop so it now calls some syscalls as well (write()):
# date;time taskset -c 16 sh -c "i=1; while [ \$i -lt 10000000 ]; do echo \$i >/dev/null; ((i++)); done"
...
$ mpstat -P 16 1
...
01:47:34     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:47:35      16   81.00    0.00   19.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:47:35     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:47:36      16   80.20    0.00   19.80    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:47:36     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:47:37      16   80.81    0.00   19.19    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:47:37     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:47:38      16   80.20    0.00   19.80    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:47:38     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:47:39      16   80.81    0.00   19.19    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:47:39     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:47:40      16   80.20    0.00   19.80    0.00    0.00    0.00    0.00    0.00    0.00    0.00

01:47:40     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:47:41      16   80.81    0.00   19.19    0.00    0.00    0.00    0.00    0.00    0.00    0.00
...
As suspected the CPU stats are now constantly updated and mpstat is reporting expected values.
 
Notice that the cpu usage reporting issue is not specific to mpstat and other tools behave in a similar manner as they all get stats from /proc/stat.
 
Another interesting observation is, that if you start tracing such an application by using tools like strace or bpftrace (e.g. bpftrace -e 'profile:hz:99/cpu==16/{@=count();}', etc.) this results in the cpu stats to be constantly updated and the issue "miraculously" goes away...
 
Mystery solved.


btw: in case you wonder how likely it is that an actual real application wouldn't issue any (or very rarely) syscalls - well, with network kernel bypass this definitetly can and is the case with some apps


Update1: if a spinning process is running under SCHED_RR then the cpu accounting issue goes away, but it does happen under SCHED_OTHER and SCHED_FIFO - although less so with the latter.