Robert Milkowski's blog: October 2008

Friday, October 31, 2008

Solaris 10 10/08

Solaris 10 10/08 aka Update 6 is finally out - you can download it from here.
Among many new features and updates couple of my favorites:

LSI Mega SAS driver (Dell PERC 5/E, 5/I, 6E, 6I among others)
Root filesystem on ZFS
ZFS send enhacements (recursive send, cumulative incrementals, etc.)
ZFS failmode property - wait, continue or panic in case of catastrophic pool failure
zpool history -l pool - additionally provides user name, hostname and zone name
ZFS slogs
ZFS Hot-Plug support - hands-free disk auto-replacement
ZFS GZIP compression

There are many other enhancements, check below URLs for more information on them.

What's New by update
New Features in U6
What's New

Thursday, October 30, 2008

Oracle, Listener, TCP/IP and Performance

Recently we have migrated Oracle 9 database to a new cluster to address performance issues and provide some spare capacity. New servers have more CPU power, more memory and faster storage. After migration everything seemed fine - the most heavy sql queries were completing in a much shorter time and there was still plenty of spare CPU cycles available and thanks to more memory the database was doing much less IOs.

Nevertheless we started to get some complains that performance is not that good from time to time. Quick investigation showed that we don't have CPU usage spikes, nor IO spikes, no network issues... but when I tried to connect to the database using telnet from a client from time to time it hung for a couple of seconds, sometimes much longer, before it get into connected state. Well, that suggested it was a network problem after all. I checked if while I got an issue with my telnet other network traffic is passing properly - and it was. So it did not seem like an issue with already established connections.

Most common case when your network traffic is fine but you have issues with establishing new tcp connections is that listen backlog queue is saturating. On Solaris you have two kernel tunables responsible for listen backlog queue:


tcp_conn_req_max_q
  Specifies the default maximum number of pending TCP connections
  for a TCP listener waiting to be accepted by accept(3SOCKET).

tcp_conn_req_max_q0
  Specifies the default maximum number of incomplete (three-way
  handshake not yet finished) pending TCP connections for a TCP listener.

The problem is that if a server drops your packets due to an overflow of one of the above queues a client will try to do a tcp re-transmit for several times each time increasing a delay between retransmissions. So in our case it could take a lot of time to establish connection but once established the database was responding really fast.

In order to check if one of these queues has saturated and system started dropping packets I used 'netstat -sP tcp':


# netstat -sP tcp 1 | grep ListenDrop
    tcpListenDrop       =785829     tcpListenDropQ0     =     0
    tcpListenDrop       =    27     tcpListenDropQ0     =     0
    tcpListenDrop       =    19     tcpListenDropQ0     =     0
^C

So there was some number of dropped packets in ListenDrop queue while ListenDropQ0 has not had been saturated at all since last reboot.

The default limit for tcp_conn_req_max_q on Solaris 10 is 128. I increased it to 1024 by issuing ' ndd -set /dev/tcp tcp_conn_req_max_q 1024' and then restarting Oracle listener. Unfortunately it did not fix the issue and system was still dropping packets. I didn't think that the queue limit is too small as there were not that many drops occurring. But I increased only a system limit - application can actually request other value for a backlog as long as it doesn't exceed the limit. So probably Oracle's listener is setting some relatively low value of backlog when it does listen(3socket) call. I checked it with dtrace:


dtrace -n syscall::listen:entry'{trace(execname);trace(pid);trace(arg0);trace(arg1);}

and then restarted listener. It turned out that listener while calling listen() is requesting backlog queue size of 5. A quick search by using Google and I found that there is an Oracle parameter you can put in a listener config file called QUEUESIZE and the default value is 5. I changed the parameter to 512 (system limit has already been increased to 1024 a moment ago):


listener.ora
[...]
  (ADDRESS_LIST =
    (ADDRESS = (PROTOCOL = TCP)(HOST = XXXXX)(PORT = 1521)(QUEUESIZE = 512))
[...]

Then I restarted listener and confirmed with dtrace that it did requested backlog size of 512. That was over two weeks ago and since then there hasn't been even a single drop, not to mention that no one has reported any performance problems with the database since then.

Problem fixed.

Saturday, October 25, 2008

ZFS as root-fs in S10U6

Tuesday, October 21, 2008

Intel SSD Drives + ZFS

Marketing, nevertheless it highlights the benefits of ZFS's L2ARC and SLOGs.

update: presentation by Bill Moore on performance of SSD+ZFS

T6340 Blade

Sun has announced new T6340 blade. I've been playing and deploying systems with UltraSparcT2+ for several months now and I can say it's a great platform for consolidation or migration from older HW. We've migrated from older v480 and v440 to a cluster of T5220 servers utilizing branded zones (Solaris 8 and Solaris 9 zones). T6340 is basically a T5240 in a blade form. All I can say is: it just works. The performance is excellent.

What a lot of people do not realize is that one can easily consolidate several e4500s or e6500 or any other old SPARC servers on Niagara based servers. Not only the performance will be better but you will save on support and operational costs. Additionally you can use branded zones to migrate your Solaris 8 or Solaris 9 environment into a zone on Solaris 10 if you can't justify deploying your applications directly on Solaris 10. You can always create another, Solaris 10, zone later on and migrate your application between zones within the same hardware.

Friday, October 17, 2008

Open Solaris on IBM System Z

Sirius Project Page has interesting to read Release Notes which explains Sun's motivation to port Open Solaris to System Z and provides lot of technical details on OS implementation. You will also find there very interesting history of Solaris port to x86 platform.

Rock at Hot Chips

stmsboot update

Are you enabling MPxIO manually because you don't like stmsboot? Are you disappointed when -L doesn't work? Have you ever wonder why the heck do we need to reboots? Well, seems like these issues have been addressed and integrated into build 99.

Distribution Constructor

DC Project Page:

"The Distribution Constructor (DC) is a command line tool for building custom Open Solaris images. At this time, the tool takes a manifest file as input, and outputs an ISO image based on parameters specified in the manifest. Optionally, an USB image can be created based on the generated ISO image. Other output image formats are possible, but it is not produced by the DC team at this time."

Tuesday, October 14, 2008

xVM Server

If you like black-box approach to virtualization then see this presentation.

Of course there is Open Solaris underneath but you don't need to know about it.
Now I'm trying to get an early access to xVM Server :)

Time Slider

Monday, October 13, 2008

T5440

4x chips, 8-cores in each chip, 8 strands per each core = 256 virtual CPUs and 32x built-in HW crypto accelerators, up-to 512GB RAM, 8x PCI slots, 4x on-board GbE, 4x internal disk drives, all in 4U.

Allan Packer has posted more info on the beast.

If you consolidate applications which can easily take advantage of many threads you can get really good throughput for reasonable money - in some cases you will get better performance than 3x IBM p570 and it will cost you MUCH less.

Interesting application of the T5440 would be a 2-node cluster utilizing Solaris 8, Solaris 9 and native SOlaris 10 zones - that allows you quick and easy consolidation and migration of old environments. I did something similar in a past but utilizing T5220 servers.

T5440 PCI-E I/O Performance
T5440 Architecture Overview
T5440 Photos

Tuesday, October 07, 2008

Niagara2+ ssh improvements

Recently ssh/openssl changes have been integrated into ssh so if runing on Niagara-2 CPUs ssh will automatically, take advantage of buitl-in HW Crypto. So let's see what improvement we can expect.

Below is a chart of total time it took to transfer over ssh about 500MB using different ciphers.

As you can see there is a very nice improvement and it's out of the box - no special configs, no tuning - it just works.

Details below.

Solaris 10 5/08


# time dd if=/dev/zero bs=1024k count=500 | ssh root@localhost 'cat >/dev/null'

real    0m52.635s
user    0m48.256s
sys     0m4.195s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c blowfish root@localhost 'cat >/dev/null'

real    0m39.705s
user    0m34.744s
sys     0m3.884s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c arcfour root@localhost 'cat >/dev/null'

real    0m34.551s
user    0m29.169s
sys     0m4.273s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c aes128-cbc root@localhost 'cat >/dev/null'

real    1m5.420s
user    0m54.914s
sys     0m3.963s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c aes128-ctr root@localhost 'cat >/dev/null'

real    0m52.227s
user    0m47.970s
sys     0m3.937s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c 3des root@localhost 'cat >/dev/null'

real    2m27.648s
user    2m23.886s
sys     0m4.071s

Open Solaris build 99


# time dd if=/dev/zero bs=1024k count=500 | ssh root@localhost 'cat >/dev/null'

real    0m20.478s
user    0m12.028s
sys     0m6.853s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c blowfish root@localhost 'cat >/dev/null'

real    0m43.196s
user    0m38.136s
sys     0m4.031s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c arcfour root@localhost 'cat >/dev/null'

real    0m20.443s
user    0m11.992s
sys     0m6.923s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c aes128-cbc root@localhost 'cat >/dev/null'

real    0m20.500s
user    0m12.008s
sys     0m7.054s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c aes128-ctr root@localhost 'cat >/dev/null'

real    0m21.372s
user    0m12.013s
sys     0m7.225s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c 3des root@localhost 'cat >/dev/null'

real    0m21.758s
user    0m12.396s
sys     0m7.513s