Robert Milkowski's blog: Give me my 2nd CPU back!

I logged into a server which has been put into production recently and has been running for over 100 days now. It's HP DL380 G4 server running Solaris 10 11/06. By looking into graphs I could see 50% of CPU utilization in a system for last 2 weeks. By using mpstat I could see 100% CPU utilization on CPU 1 in SYS (kernel). The outcome was that cpu 1 was not available for applications. As there's not much load yet prstat didn't show anything consuming cpu. I quickly checked different server, exactly the same spec, and everything was fine there. Nothing in logs either.

So lets dtrace! :)

# mpstat 1
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    3   0    0   419  218  191    0    0   24    0    25    0   0   0 100
1    0   0    0     2    0    0    0    0   49    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0   10   0    0   649  309  347    0    0   25    0   174    0   2   0  98
1    0   0    0     2    0    0    0    0   50    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    2   0    0   651  313  366   11    0   22    0   193    8   2   0  90
1    0   0    0     2    0    0    0    0   50    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    0   870  453  563    0    0   26    0   174    0   2   0  98
1    0   0    0     2    0    0    0    0   49    0     0    0 100   0   0
^C
#

All we can see on cpu 1 is that there are some interrupts but no csw/icsw/syscl/migr at all.
It looks like it's stuck entirely in kernel.
To confirm that nothing from user space is actually calling some system calls:

# dtrace -n sched:::on-cpu'/cpu == 1/{self->t=timestamp;}' \
  -n sched:::off-cpu'/self->t/{@[execname,pid]=sum(timestamp-self->t);self->t=0;}' \
  -n tick-2s'{printa(@);}'
dtrace: description 'sched:::on-cpu' matched 3 probes
dtrace: description 'sched:::off-cpu' matched 3 probes
dtrace: description 'tick-2s' matched 1 probe
CPU     ID                    FUNCTION:NAME
0  42146                         :tick-2s

0  42146                         :tick-2s

^C

#

So nothing was put by scheduler on cpu 1.

Lets check what's running in kernel on cpu 0 first (for comparison with cpu 1 later):

# dtrace -n profile-101'/cpu == 0/{@[stack()]=count();}' -n tick-5s'{printa(@);exit(0);}'
dtrace: description 'profile-101' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
CPU     ID                    FUNCTION:NAME
0  42148                         :tick-5s

            unix`ddi_get16+0x10
            genunix`callout_execute+0xd6
            genunix`taskq_thread+0xbc
            unix`thread_start+0x8
              1

            unix`ddi_io_put16+0x10
            genunix`callout_execute+0xd6
            genunix`taskq_thread+0xbc
            unix`thread_start+0x8
              1

            unix`_sys_sysenter_post_swapgs+0xab
              1

              2

            unix`cpu_halt+0xb2
            unix`idle+0x89
            unix`thread_start+0x8
            496


#

So in 99% it's being IDLE.
Now lets do the same for CPU 1.

# dtrace -n profile-101'/cpu == 1/{@[stack()]=count();}' -n tick-5s'{printa(@);exit(0);}'
dtrace: description 'profile-101' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
CPU     ID                    FUNCTION:NAME
0  42148                         :tick-5s

[...]

            unix`bcopy+0x14
            usba`usba_vlog+0xc7
            usba`usb_dprintf2+0xa4
            uhci`uhci_intr+0x14c
            unix`av_dispatch_autovect+0x78
            unix`intr_thread+0x50
              7

            unix`ddi_get16+0x10
            uhci`uhci_isoc_update_sw_frame_number+0x9
            uhci`uhci_intr+0xce
            unix`av_dispatch_autovect+0x78
            unix`intr_thread+0x50
             78

            unix`ddi_io_put16+0x10
            unix`av_dispatch_autovect+0x78
            unix`intr_thread+0x50
            127

            unix`ddi_get16+0x10
            unix`av_dispatch_autovect+0x78
            unix`intr_thread+0x50
            190


#


Above output confirms we're stuck in some interrupt, probably related to usb.

[checking cvs.opensolaris.org for usba_vlog function]

# dtrace -n fbt:usba:usba_vlog:entry'{@[stringof arg3]=count();}' -n tick-5s'{exit(0);}'
dtrace: description 'fbt:usba:usba_vlog:entry' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
CPU     ID                    FUNCTION:NAME
0  42148                         :tick-5s

uhci_intr: Controller halted                                 720725
#

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/usb/hcd/uhci/uhci.c#893
  887  /*
  888   * This should not occur. It occurs only if a HC controller
  889   * experiences internal problem.
  890   */
  891  if (intr_status & USBSTS_REG_HC_HALTED) {
  892   USB_DPRINTF_L2(PRINT_MASK_INTR, uhcip->uhci_log_hdl,
  893       "uhci_intr: Controller halted");
  894   cmd_reg = Get_OpReg16(USBCMD);
  895   Set_OpReg16(USBCMD, (cmd_reg | USBCMD_REG_HC_RUN));
  896  }

I'm not sure how bad it is but it doesn't look good.

Lets look around for USB.

# cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
usb0/1                         unknown      empty        unconfigured ok
usb0/2                         unknown      empty        unconfigured ok
usb1/1                         unknown      empty        unconfigured ok
usb1/2                         unknown      empty        unconfigured ok
usb2/1                         unknown      empty        unconfigured ok
usb2/2                         unknown      empty        unconfigured ok
usb3/1                         unknown      empty        unconfigured ok
usb3/2                         unknown      empty        unconfigured ok
usb4/1                         unknown      empty        unconfigured ok
usb4/2                         unknown      empty        unconfigured ok
usb4/3                         unknown      empty        unconfigured ok
usb4/4                         unknown      empty        unconfigured ok
usb4/5                         unknown      empty        unconfigured ok
usb4/6                         unknown      empty        unconfigured ok
usb4/7                         unknown      empty        unconfigured ok
usb4/8                         unknown      empty        unconfigured ok

# mpstat 1
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    3   0    0   419  218  191    0    0   24    0    25    0   0   0 100
1    0   0    0     2    0    0    0    0   49    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0   10   0    0   666  311  363    3    0   17    0   176    0   2   0  98
1    0   0    0     2    0    0    0    0   50    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    0   678  315  360    0    0   23    0   159    0   0   0 100
1    0   0    0     2    0    0    0    0   50    0     0    0 100   0   0
^C

Didn't help.

# modinfo |grep -i usb
83 fffffffff01c1000  14000  56   1  ehci (USB EHCI Driver 1.14)
84 fffffffff01d5000  28dd8   -   1  usba (USBA: USB Architecture 2.0 1.59)
85 fffffffff01fd000   c6d8  58   1  uhci (USB UHCI Controller Driver 1.47)


# devfsadm -vC
# mpstat 1
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    3   0    0   419  218  191    0    0   24    0    25    0   0   0 100
1    0   0    0     2    0    0    0    0   49    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0   10   0    0   664  315  363    0    0   23    0   181    0   0   0 100
1    0   0    0     2    0    0    0    0   50    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    0   947  501  694    1    0   26    0   279    0   2   0  98
1    0   0    0     2    0    0    0    0   51    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    0   650  307  356    0    0   25    0   175    1   1   0  98
1    0   0    0     2    0    0    0    0   50    0     0    0 100   0   0
^C

Still didn't help.

# modunload -i 84
can't unload the module: Device busy
# modunload -i 85
can't unload the module: Device busy
# modunload -i 83
can't unload the module: Device busy
# modunload -i 84
can't unload the module: Device busy
# modunload -i 85
can't unload the module: Device busy
# cfgadm -al
cfgadm: Configuration administration not supported

# modinfo |grep -i usb
83 fffffffff01c1000  14000  56   1  ehci (USB EHCI Driver 1.14)
84 fffffffff01d5000  28dd8   -   1  usba (USBA: USB Architecture 2.0 1.59)
85 fffffffff01fd000   c6d8  58   1  uhci (USB UHCI Controller Driver 1.47)
#

# mpstat 1
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    3   0    0   419  218  191    0    0   24    0    25    0   0   0 100
1    0   0    0     2    0    0    0    0   49    0     0    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    0   737  453  337    1   40    5    0   187    0   1   0  99
  1   10   0    0   164    0  323    1   36    6    0   110    0   1   0  99
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    0   526  310  221    0   26    0    0   102    0   0   0 100
  1    0   0    0    92    0  149    0   27    0    0    64    0   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    1   520  308  203    0   21    0    0    93    1   0   0  99
  1    0   0    0   100    0  168    1   29    3    0    90    0   1   0  99
^C

# dtrace -n fbt:usba:usba_vlog:entry'{@[stringof arg3]=count();}' -n tick-5s'{exit(0);}'
dtrace: description 'fbt:usba:usba_vlog:entry' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
CPU     ID                    FUNCTION:NAME
1  42148                         :tick-5s

#

That's much better - both CPUs are IDLE now.
Looks like some problem with USB device/driver.

ps. I wonder what intrstat(1M) would show - "unfortunately" the problem gone...
Robert Milkowski's blog

Monday, December 31, 2007

Give me my 2nd CPU back!

No comments: