Recently we have migrated Oracle 9 database to a new cluster to address performance issues and provide some spare capacity. New servers have more CPU power, more memory and faster storage. After migration everything seemed fine - the most heavy sql queries were completing in a much shorter time and there was still plenty of spare CPU cycles available and thanks to more memory the database was doing much less IOs.
Nevertheless we started to get some complains that performance is not that good from time to time. Quick investigation showed that we don't have CPU usage spikes, nor IO spikes, no network issues... but when I tried to connect to the database using telnet from a client from time to time it hung for a couple of seconds, sometimes much longer, before it get into connected state. Well, that suggested it was a network problem after all. I checked if while I got an issue with my telnet other network traffic is passing properly - and it was. So it did not seem like an issue with already established connections.
Most common case when your network traffic is fine but you have issues with establishing new tcp connections is that listen backlog queue is saturating. On Solaris you have two kernel tunables responsible for listen backlog queue:
tcp_conn_req_max_q
Specifies the default maximum number of pending TCP connections
for a TCP listener waiting to be accepted by accept(3SOCKET).
tcp_conn_req_max_q0
Specifies the default maximum number of incomplete (three-way
handshake not yet finished) pending TCP connections for a TCP listener.
The problem is that if a server drops your packets due to an overflow of one of the above queues a client will try to do a tcp re-transmit for several times each time increasing a delay between retransmissions. So in our case it could take a lot of time to establish connection but once established the database was responding really fast.
In order to check if one of these queues has saturated and system started dropping packets I used 'netstat -sP tcp':
# netstat -sP tcp 1 | grep ListenDrop
tcpListenDrop =785829 tcpListenDropQ0 = 0
tcpListenDrop = 27 tcpListenDropQ0 = 0
tcpListenDrop = 19 tcpListenDropQ0 = 0
^C
So there was some number of dropped packets in
ListenDrop queue while
ListenDropQ0 has not had been saturated at all since last reboot.
The default limit for
tcp_
conn_
req_max_q on Solaris 10 is 128. I increased it to 1024 by issuing '
ndd -set /dev/tcp tcp_conn_req_max_q 1024' and then restarting Oracle listener. Unfortunately it did not fix the issue and system was still dropping packets. I didn't think that the queue limit is too small as there were not that many drops occurring. But I increased only a system limit - application can actually request other value for a backlog as long as it doesn't exceed the limit. So probably Oracle's listener is setting some relatively low value of backlog when it does listen(3socket) call. I checked it with
dtrace:
dtrace -n syscall::listen:entry'{trace(execname);trace(pid);trace(arg0);trace(arg1);}
and then restarted listener. It turned out that listener while calling listen() is requesting backlog queue size of 5. A quick search by using Google and I found that there is an Oracle parameter you can put in a listener
config file called
QUEUESIZE and the default value is 5. I changed the parameter to 512 (system limit has already been increased to 1024 a moment ago):
listener.ora
[...]
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = XXXXX)(PORT = 1521)(QUEUESIZE = 512))
[...]
Then I restarted listener and confirmed with dtrace that it did requested backlog size of 512. That was over two weeks ago and since then there hasn't been even a single drop, not to mention that no one has reported any performance problems with the database since then.
Problem fixed.