Friday, November 14, 2008

Oracle's Listener Performance

Recently I posted a blog entry on TCP ListenDrops and Listener. Since the last tuning we haven't experience even single listendrop - good. Nevertheless we started experiencing some performance issues with the same Oracle instance. The problem was that although there was plenty of CPU to spare, no increase in disk I/O and generally nothing wrong with the system at first glance, getting connected to Oracle did take even 4 seconds. Once connected the database was responding really fast. Because the connection time used to be well below 1s before the impact was relatively huge - especially when you have some programs doing couple of connections and sql queries to the database while a customer is waiting for a page to be build - almost 4s extra for each database connection and you end-up with extra 4-12s of delay for an user. So what happened?

First I wanted to check if it is the database problem afterall. There was a CGI script I knew was producing the results much slower than it used to do. I was told it usually did produce the report within 5s and now it takes 8-15s to do so. Quick look around of a midlleware box and at it's historical data (CPU, run queue, network, etc.) didn't reveal anything obvious. So I quickly changed slightly the shortlived.d script from DTraceToolkit to print me a total CPU time and real time it took to execute the CGI script on a production. The real times were oscilating between 8-16s while vtime was always within 2s. What it means is that the script is definitely not CPU bound. Quick look at the script itself and it doesn't actually do much - basically two database connections and it prints parsed output. It strongly suggests that it spends most of its time waiting for the database and that's where I should focus on the problem.

I tried to connect to the database using sqlplus client and it took couple of seconds even before I got a prompt. First thought was that it is the ListenDrop issue I was blogging about recently - it wasn't. So I checked if the problem is still there if I connect to the listener locally from the db host itself instead of going over a network, to eliminate everything in between. The issue was still there so it definitely wasn't a network issue (and least not per se) and rather something local to the db box.

I decided to use truss utility on listener to see why it takes some much time to establish a session. It is easy to filter out my connection as there is nothing else connecting locally to the database so I need to look for an accept() related to an IP address of the db server. So I run sqlplus client and truss more or less at the same time and repeat it several times. It quickly turned out that it takes over 3s on average before listener even does accept() for my connection


# truss -o /tmp/milek1 -d -f -v all -ED -p 15697
15697: 3.9228 0.0001 0.0001 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 35705


It means that the problem is somewhat related to the ListenDrop issue. It means that new connection is being backloged for almost 4s before listener picks it up - it just doesn't do accept()'s quick enough to drain the queue in real time. So either we are generating more connections per second to the database or for some reason listener is slower. Because it was already late and we couldn't find any recent changes which would cause more connections to the database I decided to look at what listener actually does and how much time it takes.

After a quick analyze of truss output for several connections to the listener and knowing that listener is a single threaded process the overall picture of what it does is:

[...]
pollsys()
accept()
read() - over a network descriptor!
pipe()
pipe()
fork1()
fork1()
execve() - exec() given oracle process
_exit()
waitid() - wait for child to exit
read() - from pipe
write() - to pipe
read() - from pipe
read() - from pipe
close() - close pipes and network fd
pollsys()
[...]


And it does it in a loop. Lets see a truss output from one such a connection handling:

# truss -o /tmp/milek6 -d -f -v all -ED -p 15697
# grep ^15697 /tmp/milek6
[...]
15697: 1.5426 1.0901 0.0000 pollsys(0x10041CDA8, 2, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.5440 0.0014 0.0000 getsockname(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.5442 0.0002 0.0000 getpeername(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) Err#134 ENOTCONN
15697: 1.5463 0.0021 0.0020 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 41612
15697: 1.5464 0.0001 0.0000 getsockname(13, 0xFFFFFFFF7FFF888C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.5465 0.0001 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF88AC) = 0
15697: 1.5466 0.0001 0.0000 setsockopt(13, tcp, TCP_NODELAY, 0xFFFFFFFF7FFF8A14, 4, SOV_DEFAULT) = 0
15697: 1.5468 0.0002 0.0000 fcntl(13, F_SETFD, 0x00000001) = 0
15697: 1.5472 0.0004 0.0001 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnk59.so", F_OK) = 0
15697: 1.5473 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libngss9.so", F_OK) Err#2 ENOENT
15697: 1.5474 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnnts9.so", F_OK) Err#2 ENOENT
15697: 1.5475 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnrad9.so", F_OK) = 0
15697: 1.5476 0.0001 0.0000 sigaction(SIGPIPE, 0xFFFFFFFF7FFF8B80, 0xFFFFFFFF7FFF8CA8) = 0
15697: new: hand = 0x00000001 mask = 0x9FBFF057 0x0000FFF7 0 0 flags = 0x000C
15697: old: hand = 0x00000001 mask = 0 0 0 0 flags = 0x0000
15697: 1.5478 0.0002 0.0000 pollsys(0x10041CDA8, 3, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=0
15697: fd=13 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.5489 0.0011 0.0000 read(13, "\0F0\0\001\0\0\001 801 ,".., 8208) = 240
15697: 1.5494 0.0005 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF9E7C) = 0
15697: 1.5497 0.0003 0.0000 fcntl(13, F_SETFD, 0x00000000) = 0
15697: 1.5497 0.0000 0.0000 pipe() = 14 [15]
15697: 1.5498 0.0001 0.0000 pipe() = 16 [17]
15697: 1.5798 0.0300 0.0299 fork1() = 8195
15697: 1.5853 0.0055 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
15697: 1.6395 0.0542 0.0000 waitid(P_PID, 8195, 0xFFFFFFFF7FFF6D10, WEXITED|WTRAPPED) = 0
15697: siginfo: SIGCLD CLD_EXITED pid=8195 status=0x0000
15697: 1.6396 0.0001 0.0000 close(14) = 0
15697: 1.6397 0.0001 0.0000 close(17) = 0
15697: 1.7063 0.0666 0.0000 read(16, " N T P 0 8 1 9 7\n", 64) = 10
15697: 1.7072 0.0009 0.0000 getpid() = 15697 [1]
15697: 1.7073 0.0001 0.0000 fcntl(16, F_SETFD, 0x00000001) = 0
15697: 1.7076 0.0003 0.0000 write(15, "\0\0\0 =", 4) = 4
15697: 1.7078 0.0002 0.0000 write(15, " ( A D D R E S S = ( P R".., 61) = 61
15697: 1.7079 0.0001 0.0000 write(15, "\0\00401", 4) = 4
15697: 1.8377 0.1298 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8397 0.0020 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8397 0.0000 0.0000 close(15) = 0
15697: 1.8398 0.0001 0.0000 close(16) = 0
15697: 1.8400 0.0002 0.0000 close(13) = 0
15697: 1.8424 0.0024 0.0000 lseek(3, 0, SEEK_CUR) = 0x06919427
15697: 1.8425 0.0001 0.0001 write(3, " 0 7 - N O V - 2 0 0 8 ".., 220) = 220
15697: 2.6446 0.8021 0.0000 pollsys(0x10041CDA8, 2, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
[...]


Now let's look closely at timing. It took almost 0.3s (1.84-1.54 = 0.30) from accept() to finish handling that connection and move to another accept(). Assuming that would be an average time to handle one connection to the listener it would be able to process only about 3 connections per second! That's not much.

Check the times I marked in red in above truss output and add these times: 0.0300+0.0542+0.0666+0.1298 = .2806. Remember that it took about 0.3s for entire loop to complete. So from the moment listener does fork1() to the moment it 2nd time reads from a pipe from a child it takes more than 90% of entire real time spent in handling a request.

When listener does a fork1() it's child will basically do another fork1() and exit:

# grep ^8195 /tmp/milek6
8195: 1.5798 1.5798 0.0000 fork1() (returning as child ...) = 15697
8195: 1.5855 0.0057 0.0000 getpid() = 8195 [15697]
8195: 1.5863 0.0008 0.0000 lwp_self() = 1
8195: 1.5864 0.0001 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
8195: 1.5868 0.0004 0.0000 schedctl() = 0xFFFFFFFF7F78A000
8195: 1.6115 0.0247 0.0246 fork1() = 8197
8195: 1.6159 0.0044 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
8195: 1.6163 0.0004 0.0000 _exit(0)


Let's see what happens in a 2nd child then:


# grep ^8197 /tmp/milek6
8197: 1.6115 1.6115 0.0000 fork1() (returning as child ...) = 8195
[...]
8197: 1.6433 0.0139 0.0137 execve("/dslrp/ua01/app/oracle/product/9.2.0/bin/oracle", 0x1002A74B0, 0x1112E95E0) argc = 2
[...]

8197: 1.7062 0.0000 0.0000 write(17, " N T P 0 8 1 9 7\n", 10) = 10
[exec has already completed, a new process is up and it is signaling to listner that it is up]
[it writes to pipe, it is the first read from the pipe in the listener]


[...]
8197: 1.7153 0.0001 0.0000 cladm(CL_INITIALIZE, CL_GET_BOOTFLAG, 0xFFFFFFFF7D1044FC) = 0
8197: bootflags=CLUSTER_CONFIGURED|CLUSTER_BOOTED
8197: 1.7164 0.0011 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4FB8) = 0
8197: 1.7164 0.0000 0.0000 cladm(CL_CONFIG, CL_HIGHEST_NODEID, 0xFFFFFFFF7D104508) = 0
8197: nodeid=64
8197: 1.7165 0.0001 0.0000 cladm(CL_CONFIG, CL_NODEID, 0xFFFFFFFF7D104504) = 0
8197: nodeid=2
8197: 1.7176 0.0011 0.0010 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7186 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E28) = 0
8197: 1.7196 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4EE8) = 0
8197: 1.7206 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E88) = 0
8197: 1.7219 0.0013 0.0012 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7229 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E28) = 0
8197: 1.7238 0.0009 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4EE8) = 0
8197: 1.7256 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7273 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7291 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7308 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7325 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7343 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7360 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7377 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7395 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7412 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7429 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7447 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7465 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7482 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7499 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7517 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7534 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7552 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7570 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7587 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7605 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7622 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7639 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7657 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7674 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7691 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7709 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7726 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7743 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7760 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7778 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7795 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7812 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7830 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7847 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7864 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7882 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7899 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7917 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7934 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7951 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7968 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7986 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8003 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8020 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8037 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8054 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8072 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8089 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8106 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8123 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8140 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8158 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8175 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8192 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8209 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8226 0.0017 0.0016 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8244 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8262 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8279 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8296 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8313 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8322 0.0009 0.0008 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4FB8) = 0
8197: 1.8324 0.0002 0.0001 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF5168) = 0
[...]
8197: 1.8377 0.0001 0.0000 write(17, "\0\0\0\0", 4) = 4
[it is a second time it writes to pipe and that's the 2nd read from the pipe in listener]

[...]
8197: 1.8397 0.0002 0.0000 write(17, "\0\0\0\0", 4) = 4
[3rd time it writes to pipe and after the 3rd read from pipe listener cleans up and moves to another connection]

[...]


It calls cladm() 75 times and the total time it takes to just handle these 75 cladm()'s is about .11s - that's quite a lot of time compared to .28s spent in listener waiting for a child. If I could get rid of these cladm() I would almost double the connection rate listener can handle.
Lets see what exactly is calling cladm():

# dtrace -n syscall::cladm:entry'{@[ustack()]=count();}' -n tick-5s'{printa(@);exit(0);}'
[...]
libc.so.1`_cladm+0x4
nss_cluster.so.1`0xffffffff7d003364
nss_cluster.so.1`0xffffffff7d00219c
nss_cluster.so.1`0xffffffff7d0026e8
nss_cluster.so.1`0xffffffff7d002d1c
nss_cluster.so.1`0xffffffff7d0019b8
libc.so.1`nss_search+0x288
libnsl.so.1`_switch_gethostbyname_r+0x70
libnsl.so.1`_get_hostserv_inetnetdir_byname+0x958
libnsl.so.1`gethostbyname_r+0xb8
oracle`nttmyip+0x2c
oracle`nttcon+0x698
oracle`ntconn+0xd4
oracle`nsopen+0x840
oracle`nsgetinfo+0xb4
oracle`nsinh_hoffable+0x38
oracle`nsinh_hoff+0xe20
oracle`nsinherit+0x204
oracle`niotns+0x44c
oracle`osncon+0x3a0
64



Quick look at SunSolve and I found Document ID: 216260

Lets confirm if we have a cluster keyword in /etc/nsswitch.conf file:


# grep ^hosts /etc/nsswitch.conf
hosts: cluster files [SUCCESS=return] dns


I removed cluster keyword from the file and checked with truss again if it made any difference.


# truss -o /tmp/milek4 -d -f -v all -ED -p 15697
# grep ^15697 /tmp/milek4
[...]
15697: 1.6867 0.0020 0.0020 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 41573
15697: 1.6869 0.0002 0.0000 getsockname(13, 0xFFFFFFFF7FFF888C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.6870 0.0001 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF88AC) = 0
15697: 1.6871 0.0001 0.0000 setsockopt(13, tcp, TCP_NODELAY, 0xFFFFFFFF7FFF8A14, 4, SOV_DEFAULT) = 0
15697: 1.6881 0.0010 0.0000 fcntl(13, F_SETFD, 0x00000001) = 0
15697: 1.6885 0.0004 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnk59.so", F_OK) = 0
15697: 1.6887 0.0002 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libngss9.so", F_OK) Err#2 ENOENT
15697: 1.6888 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnnts9.so", F_OK) Err#2 ENOENT
15697: 1.6890 0.0002 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnrad9.so", F_OK) = 0
15697: 1.6891 0.0001 0.0000 sigaction(SIGPIPE, 0xFFFFFFFF7FFF8B80, 0xFFFFFFFF7FFF8CA8) = 0
15697: new: hand = 0x00000001 mask = 0x9FBFF057 0x0000FFF7 0 0 flags = 0x000C
15697: old: hand = 0x00000001 mask = 0 0 0 0 flags = 0x0000
15697: 1.6893 0.0002 0.0000 pollsys(0x10041CDA8, 3, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=0
15697: fd=13 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.6906 0.0013 0.0000 read(13, "\0F0\0\001\0\0\001 801 ,".., 8208) = 240
15697: 1.6911 0.0005 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF9E7C) = 0
15697: 1.6914 0.0003 0.0000 fcntl(13, F_SETFD, 0x00000000) = 0
15697: 1.6915 0.0001 0.0000 pipe() = 14 [15]
15697: 1.6918 0.0003 0.0000 pipe() = 16 [17]
15697: 1.7276 0.0358 0.0357 fork1() = 1214
15697: 1.7327 0.0051 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
15697: 1.7874 0.0547 0.0000 waitid(P_PID, 1214, 0xFFFFFFFF7FFF6D10, WEXITED|WTRAPPED) = 0
15697: siginfo: SIGCLD CLD_EXITED pid=1214 status=0x0000
15697: 1.7877 0.0003 0.0000 close(14) = 0
15697: 1.7878 0.0001 0.0000 close(17) = 0
15697: 1.8755 0.0877 0.0000 read(16, " N T P 0 1 2 1 6\n", 64) = 10
15697: 1.8771 0.0016 0.0000 getpid() = 15697 [1]
15697: 1.8772 0.0001 0.0000 fcntl(16, F_SETFD, 0x00000001) = 0
15697: 1.8777 0.0005 0.0000 write(15, "\0\0\0 =", 4) = 4
15697: 1.8778 0.0001 0.0000 write(15, " ( A D D R E S S = ( P R".., 61) = 61
15697: 1.8779 0.0001 0.0000 write(15, "\0\00401", 4) = 4
15697: 1.8829 0.0050 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8848 0.0019 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8851 0.0003 0.0000 close(15) = 0
15697: 1.8852 0.0001 0.0000 close(16) = 0
15697: 1.8854 0.0002 0.0000 close(13) = 0
15697: 1.8878 0.0024 0.0000 lseek(3, 0, SEEK_CUR) = 0x0690820E
15697: 1.8879 0.0001 0.0000 write(3, " 0 7 - N O V - 2 0 0 8 ".., 220) = 220


Now the timings are better and if you compare numbers you will see more than 35% of time reduction it takes for listener to wait for a child before it can move to process another request. It improves the connaction rate listener can do from 3.3 per second to about 5 per second.

Why cladm() is being called 75 times each times Oracle process starts - I don't know, at first look looks like some bug in nss_cluster.so.1 library. It doesn't matter for now as it is a good workaround. Nevertheless it will have to be raised with Sun's support.

The other issue is that system hasn't been patched recently and nssswitch.conf file hasn't been changed. So while it is good to see some performance improvement there must be something else which caused listener to process new connections more slowly. I manually confirmed if the change made any difference - now sqlplus client connect in less than 2s and the CGI script executes in shorter time. Puting the cluster keyword back and numbers got worse again. Removing it and numbers improve. Good.

If you look closely again at timings you will notice that after cladm() was removed from the picture the majority of time listener spends now is waiting for two fork1()'s and execve() to complete. Now why would they be slower than before? I was almost certain it's not due to CPU but it could be due to memory and it rather has to be sometething related only to listener as once oracle process is up it does process requests as fast as usual.

I checked how big is a listener process and it was over 400MB in size (RSS). We quickly set-up another listener on the box, listening on a different tcp port, and it turned out it was only about 23MB in size. We tried to connect to the database using the new listener and it took much less than a second. We switch over some traffic to new listener and it was able to process more requests per second. We could also verify it by writing simple dtrace script which measures number of accept()'s per second break down by PID.


cat accept_rate.d
#!/usr/sbin/dtrace -qs

syscall::accept:entry
{
setopt("aggsortkey");
setopt("aggsortkeypos", "1");
@[execname,pid]=count();
}

tick-1s
{
printf("%Y\n", walltimestamp);
printa("process: %-20s [%-5d] %@5d\n",@);
clear(@);
printf("\n");
}


We left 2nd listener and reconfigured some clients tnsora files so they were load-balance between two listeners when establishing new connections to the database. That way we could compare both listeners and further convince our selfes that restart should help the performance.

Knowing that basically listener loops between pollsys() syscalls while handling new connections I wrote a simple script showing time distribution it takes listener to handle a connection.


# cat listener_loop_times.d
#!/usr/sbin/dtrace -qs

syscall::pollsys:return
/execname == "tnslsnr"/
{
self->t=timestamp;
self->vt=vtimestamp;
}

syscall::pollsys:entry
/self->t/
{
@time[execname,pid]=lquantize((timestamp-self->t)/10000000,0,30);
@vtime[execname,pid]=lquantize((vtimestamp-self->vt)/10000000,0,10);

self->t=0;
self->vt=0;
}

tick-5s
{
printf("%Y\n", walltimestamp);
printa(@time);
printa(@vtime);
}


Below yuo can see a sample output, where 25351 is a PID of the new listener and 15697 is a PID of the old listener which grew to over 400MB in size.


18 92927 :tick-5s
tnslsnr 25351
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 162
1 | 0
2 |@@@@@@@@@ 71
3 |@@@@@@@@@@@ 87
4 | 0

tnslsnr 15697
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 206
1 | 0
2 | 0
3 | 0
4 | 0
5 | 0
6 | 0
7 | 0
8 | 1
9 |@ 10
10 |@@@@@@ 67
11 |@@@@@@@@@ 90
12 |@@@@ 41
13 | 3
14 | 1
15 | 0


tnslsnr 25351
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 320
1 | 0

tnslsnr 15697
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 207
1 | 0
2 |@@@@@@@ 76
3 |@@@@@@@@@@@ 118
4 |@@ 17
5 | 0
6 | 1
7 | 0



We did restart the original listener and its performance was much better as expected.

It turned out there is an Oracle's bug (5576565) which is a memory leak in listener causing a performance problem. Oracle provides Patch 5576565 which should fix the issue.

Some quick conclusions:
  • Standard (non-mts) listener architecture is far from being highly performant
  • Use connection pooling (re-using connections + query queue) if you can
  • If you need Oracle to handle higher connection rate then deploy multiple listeners and set-up clients to load-balance between them
  • Restarting listener won't affect already established tcp/ip connection to a database, so for example all other applications using connection pooling or already long running reports shouldn't be affected by listener restart
  • MTS performance characteristics should be different
  • Choose an operating system with good observability tools

5 comments:

Anonymous said...

One of the best performance diagnosis that I've read for last several months and I read them many :)

However I'm not sure that I understand correctly, does the time of to spawn a child by using fork() is related to RSS size in Solaris? I thought that Solaris (and also Linux) uses copy-on-write. Also do you have any idea why is listener double forking?

Anonymous said...

The fork() performance hit is described in http://developers.sun.com/solaris/articles/subprocess/subprocess.html

When listener is leaking memory, it's likely it creates many small segments and doesn't free them ( thus "leak"). When doing a normal fork() it'll just duplicate all these small segments, and cause a BIG performance issue.

So looks like Oracle should change to vfork or posix_spawn to generate new child processes - it certainly doesn't fix the memory leak problem but will generate a new process faster.

milek said...

Jakub - thanks :) Yes, as Sean explained lot of memory allocations will impact fork performance. When it comes to double-forking - I guess it's done that way to avoid defunct processes.

The problem with standard listener, at least in Oracle 9, is that it is very inneficient and no matter what it won't be able to sustain a high connection ratio.

Anonymous said...

Any chance of using an interposition library here for the fork() to change it to vfork() or posix_spawn()?

Anonymous said...

Hi Robert,

Good Stuff.

I stumbled upon your post as I was researching for my issue on Solaris 5.6 with 8.0.5 database where the client is intermittently being dropped.

Very nice analysis.


- Karl Arao