Thursday, December 04, 2008

Virtualization and Clustering

Virtualization is really good for consolidating old servers onto one bigger ones. The problem however is that is the big server dies to whatever reason instead of having one small problem you end up with 10 or more small problems and usually ten small problems in a production environment is basically a big problem.

So the missing component is clustering - if you could somehow make a cluster and virtualization work together... well, you can do it for Solaris Zones for quite some times and it works really good. But what about Xen (aka xVM Server) based virtualization? There is a project called ha-xvm which is to provide a Solaris Cluster Agent to do exactly that. See a demo showing that it actually works :) And yes, it does include live migration when switching virtual server to other node.

Contributing Open Solaris Packages

Tuesday, December 02, 2008

Thursday, November 27, 2008

OS Code Swarms - cool!






OS performance diagnostics for Oracle

I found a presentation on OS performance diagnostics for Oracle. It's a good start if you haven't done much performance tuning yourself. It will let you to familiarize with different utilities and point you to some basic parameters you should be aware of. Most of the presentation is valid for any application ruining in a system not only Oracle.

Friday, November 14, 2008

Oracle's Listener Performance

Recently I posted a blog entry on TCP ListenDrops and Listener. Since the last tuning we haven't experience even single listendrop - good. Nevertheless we started experiencing some performance issues with the same Oracle instance. The problem was that although there was plenty of CPU to spare, no increase in disk I/O and generally nothing wrong with the system at first glance, getting connected to Oracle did take even 4 seconds. Once connected the database was responding really fast. Because the connection time used to be well below 1s before the impact was relatively huge - especially when you have some programs doing couple of connections and sql queries to the database while a customer is waiting for a page to be build - almost 4s extra for each database connection and you end-up with extra 4-12s of delay for an user. So what happened?

First I wanted to check if it is the database problem afterall. There was a CGI script I knew was producing the results much slower than it used to do. I was told it usually did produce the report within 5s and now it takes 8-15s to do so. Quick look around of a midlleware box and at it's historical data (CPU, run queue, network, etc.) didn't reveal anything obvious. So I quickly changed slightly the shortlived.d script from DTraceToolkit to print me a total CPU time and real time it took to execute the CGI script on a production. The real times were oscilating between 8-16s while vtime was always within 2s. What it means is that the script is definitely not CPU bound. Quick look at the script itself and it doesn't actually do much - basically two database connections and it prints parsed output. It strongly suggests that it spends most of its time waiting for the database and that's where I should focus on the problem.

I tried to connect to the database using sqlplus client and it took couple of seconds even before I got a prompt. First thought was that it is the ListenDrop issue I was blogging about recently - it wasn't. So I checked if the problem is still there if I connect to the listener locally from the db host itself instead of going over a network, to eliminate everything in between. The issue was still there so it definitely wasn't a network issue (and least not per se) and rather something local to the db box.

I decided to use truss utility on listener to see why it takes some much time to establish a session. It is easy to filter out my connection as there is nothing else connecting locally to the database so I need to look for an accept() related to an IP address of the db server. So I run sqlplus client and truss more or less at the same time and repeat it several times. It quickly turned out that it takes over 3s on average before listener even does accept() for my connection


# truss -o /tmp/milek1 -d -f -v all -ED -p 15697
15697: 3.9228 0.0001 0.0001 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 35705


It means that the problem is somewhat related to the ListenDrop issue. It means that new connection is being backloged for almost 4s before listener picks it up - it just doesn't do accept()'s quick enough to drain the queue in real time. So either we are generating more connections per second to the database or for some reason listener is slower. Because it was already late and we couldn't find any recent changes which would cause more connections to the database I decided to look at what listener actually does and how much time it takes.

After a quick analyze of truss output for several connections to the listener and knowing that listener is a single threaded process the overall picture of what it does is:

[...]
pollsys()
accept()
read() - over a network descriptor!
pipe()
pipe()
fork1()
fork1()
execve() - exec() given oracle process
_exit()
waitid() - wait for child to exit
read() - from pipe
write() - to pipe
read() - from pipe
read() - from pipe
close() - close pipes and network fd
pollsys()
[...]


And it does it in a loop. Lets see a truss output from one such a connection handling:

# truss -o /tmp/milek6 -d -f -v all -ED -p 15697
# grep ^15697 /tmp/milek6
[...]
15697: 1.5426 1.0901 0.0000 pollsys(0x10041CDA8, 2, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.5440 0.0014 0.0000 getsockname(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.5442 0.0002 0.0000 getpeername(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) Err#134 ENOTCONN
15697: 1.5463 0.0021 0.0020 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 41612
15697: 1.5464 0.0001 0.0000 getsockname(13, 0xFFFFFFFF7FFF888C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.5465 0.0001 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF88AC) = 0
15697: 1.5466 0.0001 0.0000 setsockopt(13, tcp, TCP_NODELAY, 0xFFFFFFFF7FFF8A14, 4, SOV_DEFAULT) = 0
15697: 1.5468 0.0002 0.0000 fcntl(13, F_SETFD, 0x00000001) = 0
15697: 1.5472 0.0004 0.0001 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnk59.so", F_OK) = 0
15697: 1.5473 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libngss9.so", F_OK) Err#2 ENOENT
15697: 1.5474 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnnts9.so", F_OK) Err#2 ENOENT
15697: 1.5475 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnrad9.so", F_OK) = 0
15697: 1.5476 0.0001 0.0000 sigaction(SIGPIPE, 0xFFFFFFFF7FFF8B80, 0xFFFFFFFF7FFF8CA8) = 0
15697: new: hand = 0x00000001 mask = 0x9FBFF057 0x0000FFF7 0 0 flags = 0x000C
15697: old: hand = 0x00000001 mask = 0 0 0 0 flags = 0x0000
15697: 1.5478 0.0002 0.0000 pollsys(0x10041CDA8, 3, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=0
15697: fd=13 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.5489 0.0011 0.0000 read(13, "\0F0\0\001\0\0\001 801 ,".., 8208) = 240
15697: 1.5494 0.0005 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF9E7C) = 0
15697: 1.5497 0.0003 0.0000 fcntl(13, F_SETFD, 0x00000000) = 0
15697: 1.5497 0.0000 0.0000 pipe() = 14 [15]
15697: 1.5498 0.0001 0.0000 pipe() = 16 [17]
15697: 1.5798 0.0300 0.0299 fork1() = 8195
15697: 1.5853 0.0055 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
15697: 1.6395 0.0542 0.0000 waitid(P_PID, 8195, 0xFFFFFFFF7FFF6D10, WEXITED|WTRAPPED) = 0
15697: siginfo: SIGCLD CLD_EXITED pid=8195 status=0x0000
15697: 1.6396 0.0001 0.0000 close(14) = 0
15697: 1.6397 0.0001 0.0000 close(17) = 0
15697: 1.7063 0.0666 0.0000 read(16, " N T P 0 8 1 9 7\n", 64) = 10
15697: 1.7072 0.0009 0.0000 getpid() = 15697 [1]
15697: 1.7073 0.0001 0.0000 fcntl(16, F_SETFD, 0x00000001) = 0
15697: 1.7076 0.0003 0.0000 write(15, "\0\0\0 =", 4) = 4
15697: 1.7078 0.0002 0.0000 write(15, " ( A D D R E S S = ( P R".., 61) = 61
15697: 1.7079 0.0001 0.0000 write(15, "\0\00401", 4) = 4
15697: 1.8377 0.1298 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8397 0.0020 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8397 0.0000 0.0000 close(15) = 0
15697: 1.8398 0.0001 0.0000 close(16) = 0
15697: 1.8400 0.0002 0.0000 close(13) = 0
15697: 1.8424 0.0024 0.0000 lseek(3, 0, SEEK_CUR) = 0x06919427
15697: 1.8425 0.0001 0.0001 write(3, " 0 7 - N O V - 2 0 0 8 ".., 220) = 220
15697: 2.6446 0.8021 0.0000 pollsys(0x10041CDA8, 2, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
[...]


Now let's look closely at timing. It took almost 0.3s (1.84-1.54 = 0.30) from accept() to finish handling that connection and move to another accept(). Assuming that would be an average time to handle one connection to the listener it would be able to process only about 3 connections per second! That's not much.

Check the times I marked in red in above truss output and add these times: 0.0300+0.0542+0.0666+0.1298 = .2806. Remember that it took about 0.3s for entire loop to complete. So from the moment listener does fork1() to the moment it 2nd time reads from a pipe from a child it takes more than 90% of entire real time spent in handling a request.

When listener does a fork1() it's child will basically do another fork1() and exit:

# grep ^8195 /tmp/milek6
8195: 1.5798 1.5798 0.0000 fork1() (returning as child ...) = 15697
8195: 1.5855 0.0057 0.0000 getpid() = 8195 [15697]
8195: 1.5863 0.0008 0.0000 lwp_self() = 1
8195: 1.5864 0.0001 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
8195: 1.5868 0.0004 0.0000 schedctl() = 0xFFFFFFFF7F78A000
8195: 1.6115 0.0247 0.0246 fork1() = 8197
8195: 1.6159 0.0044 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
8195: 1.6163 0.0004 0.0000 _exit(0)


Let's see what happens in a 2nd child then:


# grep ^8197 /tmp/milek6
8197: 1.6115 1.6115 0.0000 fork1() (returning as child ...) = 8195
[...]
8197: 1.6433 0.0139 0.0137 execve("/dslrp/ua01/app/oracle/product/9.2.0/bin/oracle", 0x1002A74B0, 0x1112E95E0) argc = 2
[...]

8197: 1.7062 0.0000 0.0000 write(17, " N T P 0 8 1 9 7\n", 10) = 10
[exec has already completed, a new process is up and it is signaling to listner that it is up]
[it writes to pipe, it is the first read from the pipe in the listener]


[...]
8197: 1.7153 0.0001 0.0000 cladm(CL_INITIALIZE, CL_GET_BOOTFLAG, 0xFFFFFFFF7D1044FC) = 0
8197: bootflags=CLUSTER_CONFIGURED|CLUSTER_BOOTED
8197: 1.7164 0.0011 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4FB8) = 0
8197: 1.7164 0.0000 0.0000 cladm(CL_CONFIG, CL_HIGHEST_NODEID, 0xFFFFFFFF7D104508) = 0
8197: nodeid=64
8197: 1.7165 0.0001 0.0000 cladm(CL_CONFIG, CL_NODEID, 0xFFFFFFFF7D104504) = 0
8197: nodeid=2
8197: 1.7176 0.0011 0.0010 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7186 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E28) = 0
8197: 1.7196 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4EE8) = 0
8197: 1.7206 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E88) = 0
8197: 1.7219 0.0013 0.0012 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7229 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E28) = 0
8197: 1.7238 0.0009 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4EE8) = 0
8197: 1.7256 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7273 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7291 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7308 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7325 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7343 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7360 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7377 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7395 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7412 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7429 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7447 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7465 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7482 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7499 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7517 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7534 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7552 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7570 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7587 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7605 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7622 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7639 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7657 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7674 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7691 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7709 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7726 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7743 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7760 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7778 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7795 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7812 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7830 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7847 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7864 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7882 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7899 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7917 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7934 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7951 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7968 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7986 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8003 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8020 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8037 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8054 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8072 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8089 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8106 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8123 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8140 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8158 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8175 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8192 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8209 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8226 0.0017 0.0016 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8244 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8262 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8279 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8296 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8313 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8322 0.0009 0.0008 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4FB8) = 0
8197: 1.8324 0.0002 0.0001 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF5168) = 0
[...]
8197: 1.8377 0.0001 0.0000 write(17, "\0\0\0\0", 4) = 4
[it is a second time it writes to pipe and that's the 2nd read from the pipe in listener]

[...]
8197: 1.8397 0.0002 0.0000 write(17, "\0\0\0\0", 4) = 4
[3rd time it writes to pipe and after the 3rd read from pipe listener cleans up and moves to another connection]

[...]


It calls cladm() 75 times and the total time it takes to just handle these 75 cladm()'s is about .11s - that's quite a lot of time compared to .28s spent in listener waiting for a child. If I could get rid of these cladm() I would almost double the connection rate listener can handle.
Lets see what exactly is calling cladm():

# dtrace -n syscall::cladm:entry'{@[ustack()]=count();}' -n tick-5s'{printa(@);exit(0);}'
[...]
libc.so.1`_cladm+0x4
nss_cluster.so.1`0xffffffff7d003364
nss_cluster.so.1`0xffffffff7d00219c
nss_cluster.so.1`0xffffffff7d0026e8
nss_cluster.so.1`0xffffffff7d002d1c
nss_cluster.so.1`0xffffffff7d0019b8
libc.so.1`nss_search+0x288
libnsl.so.1`_switch_gethostbyname_r+0x70
libnsl.so.1`_get_hostserv_inetnetdir_byname+0x958
libnsl.so.1`gethostbyname_r+0xb8
oracle`nttmyip+0x2c
oracle`nttcon+0x698
oracle`ntconn+0xd4
oracle`nsopen+0x840
oracle`nsgetinfo+0xb4
oracle`nsinh_hoffable+0x38
oracle`nsinh_hoff+0xe20
oracle`nsinherit+0x204
oracle`niotns+0x44c
oracle`osncon+0x3a0
64



Quick look at SunSolve and I found Document ID: 216260

Lets confirm if we have a cluster keyword in /etc/nsswitch.conf file:


# grep ^hosts /etc/nsswitch.conf
hosts: cluster files [SUCCESS=return] dns


I removed cluster keyword from the file and checked with truss again if it made any difference.


# truss -o /tmp/milek4 -d -f -v all -ED -p 15697
# grep ^15697 /tmp/milek4
[...]
15697: 1.6867 0.0020 0.0020 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 41573
15697: 1.6869 0.0002 0.0000 getsockname(13, 0xFFFFFFFF7FFF888C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.6870 0.0001 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF88AC) = 0
15697: 1.6871 0.0001 0.0000 setsockopt(13, tcp, TCP_NODELAY, 0xFFFFFFFF7FFF8A14, 4, SOV_DEFAULT) = 0
15697: 1.6881 0.0010 0.0000 fcntl(13, F_SETFD, 0x00000001) = 0
15697: 1.6885 0.0004 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnk59.so", F_OK) = 0
15697: 1.6887 0.0002 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libngss9.so", F_OK) Err#2 ENOENT
15697: 1.6888 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnnts9.so", F_OK) Err#2 ENOENT
15697: 1.6890 0.0002 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnrad9.so", F_OK) = 0
15697: 1.6891 0.0001 0.0000 sigaction(SIGPIPE, 0xFFFFFFFF7FFF8B80, 0xFFFFFFFF7FFF8CA8) = 0
15697: new: hand = 0x00000001 mask = 0x9FBFF057 0x0000FFF7 0 0 flags = 0x000C
15697: old: hand = 0x00000001 mask = 0 0 0 0 flags = 0x0000
15697: 1.6893 0.0002 0.0000 pollsys(0x10041CDA8, 3, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=0
15697: fd=13 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.6906 0.0013 0.0000 read(13, "\0F0\0\001\0\0\001 801 ,".., 8208) = 240
15697: 1.6911 0.0005 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF9E7C) = 0
15697: 1.6914 0.0003 0.0000 fcntl(13, F_SETFD, 0x00000000) = 0
15697: 1.6915 0.0001 0.0000 pipe() = 14 [15]
15697: 1.6918 0.0003 0.0000 pipe() = 16 [17]
15697: 1.7276 0.0358 0.0357 fork1() = 1214
15697: 1.7327 0.0051 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
15697: 1.7874 0.0547 0.0000 waitid(P_PID, 1214, 0xFFFFFFFF7FFF6D10, WEXITED|WTRAPPED) = 0
15697: siginfo: SIGCLD CLD_EXITED pid=1214 status=0x0000
15697: 1.7877 0.0003 0.0000 close(14) = 0
15697: 1.7878 0.0001 0.0000 close(17) = 0
15697: 1.8755 0.0877 0.0000 read(16, " N T P 0 1 2 1 6\n", 64) = 10
15697: 1.8771 0.0016 0.0000 getpid() = 15697 [1]
15697: 1.8772 0.0001 0.0000 fcntl(16, F_SETFD, 0x00000001) = 0
15697: 1.8777 0.0005 0.0000 write(15, "\0\0\0 =", 4) = 4
15697: 1.8778 0.0001 0.0000 write(15, " ( A D D R E S S = ( P R".., 61) = 61
15697: 1.8779 0.0001 0.0000 write(15, "\0\00401", 4) = 4
15697: 1.8829 0.0050 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8848 0.0019 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8851 0.0003 0.0000 close(15) = 0
15697: 1.8852 0.0001 0.0000 close(16) = 0
15697: 1.8854 0.0002 0.0000 close(13) = 0
15697: 1.8878 0.0024 0.0000 lseek(3, 0, SEEK_CUR) = 0x0690820E
15697: 1.8879 0.0001 0.0000 write(3, " 0 7 - N O V - 2 0 0 8 ".., 220) = 220


Now the timings are better and if you compare numbers you will see more than 35% of time reduction it takes for listener to wait for a child before it can move to process another request. It improves the connaction rate listener can do from 3.3 per second to about 5 per second.

Why cladm() is being called 75 times each times Oracle process starts - I don't know, at first look looks like some bug in nss_cluster.so.1 library. It doesn't matter for now as it is a good workaround. Nevertheless it will have to be raised with Sun's support.

The other issue is that system hasn't been patched recently and nssswitch.conf file hasn't been changed. So while it is good to see some performance improvement there must be something else which caused listener to process new connections more slowly. I manually confirmed if the change made any difference - now sqlplus client connect in less than 2s and the CGI script executes in shorter time. Puting the cluster keyword back and numbers got worse again. Removing it and numbers improve. Good.

If you look closely again at timings you will notice that after cladm() was removed from the picture the majority of time listener spends now is waiting for two fork1()'s and execve() to complete. Now why would they be slower than before? I was almost certain it's not due to CPU but it could be due to memory and it rather has to be sometething related only to listener as once oracle process is up it does process requests as fast as usual.

I checked how big is a listener process and it was over 400MB in size (RSS). We quickly set-up another listener on the box, listening on a different tcp port, and it turned out it was only about 23MB in size. We tried to connect to the database using the new listener and it took much less than a second. We switch over some traffic to new listener and it was able to process more requests per second. We could also verify it by writing simple dtrace script which measures number of accept()'s per second break down by PID.


cat accept_rate.d
#!/usr/sbin/dtrace -qs

syscall::accept:entry
{
setopt("aggsortkey");
setopt("aggsortkeypos", "1");
@[execname,pid]=count();
}

tick-1s
{
printf("%Y\n", walltimestamp);
printa("process: %-20s [%-5d] %@5d\n",@);
clear(@);
printf("\n");
}


We left 2nd listener and reconfigured some clients tnsora files so they were load-balance between two listeners when establishing new connections to the database. That way we could compare both listeners and further convince our selfes that restart should help the performance.

Knowing that basically listener loops between pollsys() syscalls while handling new connections I wrote a simple script showing time distribution it takes listener to handle a connection.


# cat listener_loop_times.d
#!/usr/sbin/dtrace -qs

syscall::pollsys:return
/execname == "tnslsnr"/
{
self->t=timestamp;
self->vt=vtimestamp;
}

syscall::pollsys:entry
/self->t/
{
@time[execname,pid]=lquantize((timestamp-self->t)/10000000,0,30);
@vtime[execname,pid]=lquantize((vtimestamp-self->vt)/10000000,0,10);

self->t=0;
self->vt=0;
}

tick-5s
{
printf("%Y\n", walltimestamp);
printa(@time);
printa(@vtime);
}


Below yuo can see a sample output, where 25351 is a PID of the new listener and 15697 is a PID of the old listener which grew to over 400MB in size.


18 92927 :tick-5s
tnslsnr 25351
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 162
1 | 0
2 |@@@@@@@@@ 71
3 |@@@@@@@@@@@ 87
4 | 0

tnslsnr 15697
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 206
1 | 0
2 | 0
3 | 0
4 | 0
5 | 0
6 | 0
7 | 0
8 | 1
9 |@ 10
10 |@@@@@@ 67
11 |@@@@@@@@@ 90
12 |@@@@ 41
13 | 3
14 | 1
15 | 0


tnslsnr 25351
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 320
1 | 0

tnslsnr 15697
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 207
1 | 0
2 |@@@@@@@ 76
3 |@@@@@@@@@@@ 118
4 |@@ 17
5 | 0
6 | 1
7 | 0



We did restart the original listener and its performance was much better as expected.

It turned out there is an Oracle's bug (5576565) which is a memory leak in listener causing a performance problem. Oracle provides Patch 5576565 which should fix the issue.

Some quick conclusions:
  • Standard (non-mts) listener architecture is far from being highly performant
  • Use connection pooling (re-using connections + query queue) if you can
  • If you need Oracle to handle higher connection rate then deploy multiple listeners and set-up clients to load-balance between them
  • Restarting listener won't affect already established tcp/ip connection to a database, so for example all other applications using connection pooling or already long running reports shouldn't be affected by listener restart
  • MTS performance characteristics should be different
  • Choose an operating system with good observability tools

Wednesday, November 12, 2008

Fishworks

When you think about Open Solaris you have mane really unique features there which you can't find on other platforms. Then there is a big market of customers who don't have or don't need proper skills to be able to use these tools - what they want is an appliance aproach, when you put something in your network and it just works for you. It would be great if one could get all the benefits of Open Solaris and yet didn't had to be a sr sys admin in order to be able to use them. When it comes to NAS appliances Fishworks is just that. It's like an NetApp filer but better - more flexible, more scalable, better performance and cheaper. You can find a good overview on new appliances here.

It is really interesting to know how it all started three years ago. About a year later Bryan asked me if I want to participate in eval program for Fishworks - I didn't even know what it was back then but hey.. it's Bryan, it has to be something exiting! So I said yes right a way :) :) :)

Now it the environments I use to work I'm not that big enthusiast for appliances nevertheless I can appreciate them in environments they belong to - and I really believe that what they come up with will be a disruptive technology in a market where NetApp has almost a monopoly and is charging insane money for their proprietary technology. Don't get me wrong, NetApp does provide very good technology - it's just that it always has been overpriced and there was no real competition for them. Thanks to open source technologies like Open Solaris, ZFS, Dtrace, etc. it's no longer the case. Actually in the environments I used to work thanks to Open Solaris/ZFS NetApp usually is no longer an option at all. Now thanks to Fishworks you have the alternative for NetApp, and I believe it is a much better alternative. And for everyone for who it is important - Fishworks is built on top of open source software with file system being the most important building block. NetApp file system is a proprietary and closed and you end-up being locked-in by them to some degree.

Almost two years ago I wrote one of my emails with observations about early Fishworks prototype and I was so exited with Analytics that I went wild and proposed to create Doctor D! - virtual advisor which would help you to tell what's wrong with your system and what you should do to improve the situation. Of course we can't deliver something like this in a forseable future. Nevertheless the Analytics gives you so much insight in a system that in essence you are becoming Doctor D! All you need to do is to correlate all forensic data presented to you in a very attractive manner. Of course it uses DTrace underneath, that's where D! is coming from, but because we are talking about appliance you don't have to know Dtrace or any other OS tools at all and yet still you can harnes the power of these tools and fix your issue.


You can get more information on official Fishworks blog.
Also check below blogs for more interesting info.

New stories I got from here:


Wall Street Journal
: Sun Expands 'Open' Storage Line
http://online.wsj.com/article/SB122627611222612077.html?mod=googlenews_wsj

Forbes: Sun's Flash of Hope
http://www.forbes.com/cionetwork/2008/11/08/sun-storage-memory-tech-enter-cx_ag_1110sun.html

San Jose Mercury News
: Sun hopes new line of storage servers will help revive its fortunes
http://www.mercurynews.com/breakingnews/ci_10929798?nclick_check=1

eWeek: Sun Discards RAID in Its First-ever Storage Appliance
http://www.eweek.com/c/a/Data-Storage/Sun-Unveils-Its-Firstever-Storage-Appliance/

IDG News Service: Sun Rolls out Its Own Storage Appliances
http://www.pcworld.com/businesscenter/article/153566/sun_rolls_out_its_own_storage_appliances.html

InformationWeek: Sun Unveils 'Open Storage' Appliances
http://www.informationweek.com/news/storage/systems/showArticle.jhtml?articleID=212001591

The Register: Sun trumpets radically simple open storage boxes
http://www.theregister.co.uk/2008/11/10/suns_amber_road_storage/

CRN: Sun Intros Solid State Drives, Analytics In Storage Line
http://www.crn.com/storage/212001321

CNET News: Sun expands its open storage line, hopes for accelerated growth
http://news.cnet.com/8301-13505_3-10092362-16.html

ZDNet: Sun claims to revolutionise storage with new array
http://news.zdnet.co.uk/hardware/0,1000000091,39547411,00.htm

TheStreet: Sun to Unveil New Open Storage Hardware
http://www.thestreet.com/story/10446870/1/sun-to-introduce-new-open-storage-hardware.html?puc=googlefi&cm_ven=GOOGLEFI&cm_cat=FREE&cm_ite=NA

Friday, October 31, 2008

Solaris 10 10/08

Solaris 10 10/08 aka Update 6 is finally out - you can download it from here.
Among many new features and updates couple of my favorites:

  • LSI Mega SAS driver (Dell PERC 5/E, 5/I, 6E, 6I among others)
  • Root filesystem on ZFS
  • ZFS send enhacements (recursive send, cumulative incrementals, etc.)
  • ZFS failmode property - wait, continue or panic in case of catastrophic pool failure
  • zpool history -l pool - additionally provides user name, hostname and zone name
  • ZFS slogs
  • ZFS Hot-Plug support - hands-free disk auto-replacement
  • ZFS GZIP compression
There are many other enhancements, check below URLs for more information on them.

What's New by update
New Features in U6
What's New

Thursday, October 30, 2008

Oracle, Listener, TCP/IP and Performance

Recently we have migrated Oracle 9 database to a new cluster to address performance issues and provide some spare capacity. New servers have more CPU power, more memory and faster storage. After migration everything seemed fine - the most heavy sql queries were completing in a much shorter time and there was still plenty of spare CPU cycles available and thanks to more memory the database was doing much less IOs.

Nevertheless we started to get some complains that performance is not that good from time to time. Quick investigation showed that we don't have CPU usage spikes, nor IO spikes, no network issues... but when I tried to connect to the database using telnet from a client from time to time it hung for a couple of seconds, sometimes much longer, before it get into connected state. Well, that suggested it was a network problem after all. I checked if while I got an issue with my telnet other network traffic is passing properly - and it was. So it did not seem like an issue with already established connections.

Most common case when your network traffic is fine but you have issues with establishing new tcp connections is that listen backlog queue is saturating. On Solaris you have two kernel tunables responsible for listen backlog queue:

tcp_conn_req_max_q
Specifies the default maximum number of pending TCP connections
for a TCP listener waiting to be accepted by accept(3SOCKET).

tcp_conn_req_max_q0
Specifies the default maximum number of incomplete (three-way
handshake not yet finished) pending TCP connections for a TCP listener.


The problem is that if a server drops your packets due to an overflow of one of the above queues a client will try to do a tcp re-transmit for several times each time increasing a delay between retransmissions. So in our case it could take a lot of time to establish connection but once established the database was responding really fast.

In order to check if one of these queues has saturated and system started dropping packets I used 'netstat -sP tcp':

# netstat -sP tcp 1 | grep ListenDrop
tcpListenDrop =785829 tcpListenDropQ0 = 0
tcpListenDrop = 27 tcpListenDropQ0 = 0
tcpListenDrop = 19 tcpListenDropQ0 = 0
^C

So there was some number of dropped packets in ListenDrop queue while ListenDropQ0 has not had been saturated at all since last reboot.

The default limit for tcp_conn_req_max_q on Solaris 10 is 128. I increased it to 1024 by issuing ' ndd -set /dev/tcp tcp_conn_req_max_q 1024' and then restarting Oracle listener. Unfortunately it did not fix the issue and system was still dropping packets. I didn't think that the queue limit is too small as there were not that many drops occurring. But I increased only a system limit - application can actually request other value for a backlog as long as it doesn't exceed the limit. So probably Oracle's listener is setting some relatively low value of backlog when it does listen(3socket) call. I checked it with dtrace:

dtrace -n syscall::listen:entry'{trace(execname);trace(pid);trace(arg0);trace(arg1);}

and then restarted listener. It turned out that listener while calling listen() is requesting backlog queue size of 5. A quick search by using Google and I found that there is an Oracle parameter you can put in a listener config file called QUEUESIZE and the default value is 5. I changed the parameter to 512 (system limit has already been increased to 1024 a moment ago):

listener.ora
[...]
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = XXXXX)(PORT = 1521)(QUEUESIZE = 512))
[...]

Then I restarted listener and confirmed with dtrace that it did requested backlog size of 512. That was over two weeks ago and since then there hasn't been even a single drop, not to mention that no one has reported any performance problems with the database since then.

Problem fixed.

Tuesday, October 21, 2008

Intel SSD Drives + ZFS

Marketing, nevertheless it highlights the benefits of ZFS's L2ARC and SLOGs.

update: presentation by Bill Moore on performance of SSD+ZFS

T6340 Blade

Sun has announced new T6340 blade. I've been playing and deploying systems with UltraSparcT2+ for several months now and I can say it's a great platform for consolidation or migration from older HW. We've migrated from older v480 and v440 to a cluster of T5220 servers utilizing branded zones (Solaris 8 and Solaris 9 zones). T6340 is basically a T5240 in a blade form. All I can say is: it just works. The performance is excellent.

What a lot of people do not realize is that one can easily consolidate several e4500s or e6500 or any other old SPARC servers on Niagara based servers. Not only the performance will be better but you will save on support and operational costs. Additionally you can use branded zones to migrate your Solaris 8 or Solaris 9 environment into a zone on Solaris 10 if you can't justify deploying your applications directly on Solaris 10. You can always create another, Solaris 10, zone later on and migrate your application between zones within the same hardware.

Friday, October 17, 2008

Open Solaris on IBM System Z

Sirius Project Page has interesting to read Release Notes which explains Sun's motivation to port Open Solaris to System Z and provides lot of technical details on OS implementation. You will also find there very interesting history of Solaris port to x86 platform.

Rock at Hot Chips

stmsboot update

Are you enabling MPxIO manually because you don't like stmsboot? Are you disappointed when -L doesn't work? Have you ever wonder why the heck do we need to reboots? Well, seems like these issues have been addressed and integrated into build 99.

Distribution Constructor

DC Project Page:
"The Distribution Constructor (DC) is a command line tool for building custom Open Solaris images. At this time, the tool takes a manifest file as input, and outputs an ISO image based on parameters specified in the manifest. Optionally, an USB image can be created based on the generated ISO image. Other output image formats are possible, but it is not produced by the DC team at this time."


Tuesday, October 14, 2008

xVM Server

If you like black-box approach to virtualization then see this presentation.

Of course there is Open Solaris underneath but you don't need to know about it.
Now I'm trying to get an early access to xVM Server :)

Time Slider

Monday, October 13, 2008

T5440

4x chips, 8-cores in each chip, 8 strands per each core = 256 virtual CPUs and 32x built-in HW crypto accelerators, up-to 512GB RAM, 8x PCI slots, 4x on-board GbE, 4x internal disk drives, all in 4U.

Allan Packer has posted more info on the beast.

If you consolidate applications which can easily take advantage of many threads you can get really good throughput for reasonable money - in some cases you will get better performance than 3x IBM p570 and it will cost you MUCH less.

Interesting application of the T5440 would be a 2-node cluster utilizing Solaris 8, Solaris 9 and native SOlaris 10 zones - that allows you quick and easy consolidation and migration of old environments. I did something similar in a past but utilizing T5220 servers.

T5440 PCI-E I/O Performance
T5440 Architecture Overview
T5440 Photos

Tuesday, October 07, 2008

Niagara2+ ssh improvements

Recently ssh/openssl changes have been integrated into ssh so if runing on Niagara-2 CPUs ssh will automatically, take advantage of buitl-in HW Crypto. So let's see what improvement we can expect.

Below is a chart of total time it took to transfer over ssh about 500MB using different ciphers.


As you can see there is a very nice improvement and it's out of the box - no special configs, no tuning - it just works.


Details below.


Solaris 10 5/08

# time dd if=/dev/zero bs=1024k count=500 | ssh root@localhost 'cat >/dev/null'

real 0m52.635s
user 0m48.256s
sys 0m4.195s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c blowfish root@localhost 'cat >/dev/null'

real 0m39.705s
user 0m34.744s
sys 0m3.884s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c arcfour root@localhost 'cat >/dev/null'

real 0m34.551s
user 0m29.169s
sys 0m4.273s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c aes128-cbc root@localhost 'cat >/dev/null'

real 1m5.420s
user 0m54.914s
sys 0m3.963s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c aes128-ctr root@localhost 'cat >/dev/null'

real 0m52.227s
user 0m47.970s
sys 0m3.937s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c 3des root@localhost 'cat >/dev/null'

real 2m27.648s
user 2m23.886s
sys 0m4.071s


Open Solaris build 99

# time dd if=/dev/zero bs=1024k count=500 | ssh root@localhost 'cat >/dev/null'

real 0m20.478s
user 0m12.028s
sys 0m6.853s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c blowfish root@localhost 'cat >/dev/null'

real 0m43.196s
user 0m38.136s
sys 0m4.031s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c arcfour root@localhost 'cat >/dev/null'

real 0m20.443s
user 0m11.992s
sys 0m6.923s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c aes128-cbc root@localhost 'cat >/dev/null'

real 0m20.500s
user 0m12.008s
sys 0m7.054s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c aes128-ctr root@localhost 'cat >/dev/null'

real 0m21.372s
user 0m12.013s
sys 0m7.225s

# time dd if=/dev/zero bs=1024k count=500 | ssh -c 3des root@localhost 'cat >/dev/null'

real 0m21.758s
user 0m12.396s
sys 0m7.513s

Tuesday, September 30, 2008

SSH on Niagara T2

In build 99 of Open Solaris among other new features one is quite interesting - out of the box use of built-in HW acceleration of UltraSparc T2 in ssh. It semms like it can deliver over 2x speed-up for scp. And yes, it's going to be back ported to Solaris 10. You can read more here.

Virtual Consoles in Solaris

One of the missing features of Solaris used to be lack of Virtual Consoles - something we are accustomed so much in Linux. Solaris used to have virtual consoles once.. long long time ago, then they disappeared. Finally, they are coming back into Open Solaris and have been integrated into build 100.

Rock support in Open Solaris

Now, how do I get an access to beta HW with Rock on board?

Tuesday, September 23, 2008

Fast Reboot on x86

What if you have a x86 system with lots of memory and you want to reboot it? It will probably take minutes before your system starts booting. What if you could bypass BIOS and POST testing and go directly to bootloader? Such a nice feature has been just integrated into Open Solaris build 100. See PSARC 2008/382 for more details.

A new "-f" flag has been introduced to the "reboot" command
which allows a faster reboot that bypasses the BIOS and grub.

Thursday, September 18, 2008

MySQL Scalability

NetApp's PAM

NetApp introduced a new feature called PAM - Performance Acceleration Module.
"In the simplest terms, the Performance Acceleration Module is a second-layer cache: a cache used to hold blocks evicted from the WAFL® buffer cache. (WAFL is the NetApp® Write Anywhere File Layout, which defines how NetApp lays out data on disk. The WAFL buffer cache is a read cache maintained by WAFL in system memory.) In a system without PAM, any attempt to read data that is not in system memory results in a disk read. With PAM, the storage system first checks to see whether a requested read has been cached in one of its installed modules before issuing a disk read. Data ONTAP® maintains a set of cache tags in system memory and can determine whether or not a block resides in PAM without accessing the card. This reduces access latency because only one DMA operation is required on a cache hit."
It all sounds very good and is very, very similar to L2ARC feature in ZFS, except that it seems like NetApp only supports very expensive PCI/DRAM cards with relatively small capacity. With ZFS L2ARC not only you can get such functionality for free, but it is also more flexible and allows you to use any block device in a system as L2ARC - so you can provide hundreds of GBs of cache for a fraction of cost of NetApp's cards. And ZFS is open sourced :)

Intel SSD Drives

Intel announced some time ago new line of SSD drives. For a consumer market the X-25M looks very promising. You can read reviews here and here.

Thursday, September 04, 2008

VirtualBox 2.0

Sun has released new version of VirtualBox. Main new features are:

  • 64 bits guest support (64 bits host only)
  • New native Leopard user interface on MacOS X hosts
  • The GUI was converted from Qt3 to Qt4 with many visual improvements
  • New-version notifier
  • Guest property information interface
  • Host Interface Networking on Mac OS X hosts
  • Host Interface Networking on Solaris 10 hosts
  • Support for Nested Paging on modern AMD-V CPUs (major performance gain)
  • Framework for collecting performance and resource usage data (metrics)
  • Clipboard integration for OS/2 Guests
  • Support for VHD images
  • Created separate SDK component featuring a new Python programming interface on Linux and Solaris hosts

Wednesday, September 03, 2008

5 years of DTrace

It's been five years since DTrace has been integrated into Solaris.
It's one of the most important technologies in Operating Systems.

Read Bryan's blog entry on last few hours before integration.

I remember one of my first big dtrace wins - a multi-threaded application, developed in-house by our developers had some scalability problems on multi-cpu servers. They spent about a week looking for an issue without any progress. It was early just after dtrace was made public and I decided to give it a try. At first no real progress... while I was driving home the issue was still in a back of my mind. Suddenly I was 100% sure how to solve the issue with dtrace. First time in a morning I sat down with developers and we have fixed the problem in 5 minutes. We were working on a production code in a production environment. The era of DTrace has began...

Sunday, August 24, 2008

OpenSolaris 2008.05

I've just re-installed Open Solaris on my laptop using LiveCD. I was really surprised (positively) when NWAM detected my WiFi, asked me for WEP and during installation I could brows the Internet. However what really surprised me was that after install finished and I rebooted my laptop into Open Solaris it somehow remembered my password and got my WiFi automatically working without asking any questions - really cool.

Friday, August 22, 2008

End of Service for Solaris 7

From SDN:
"On August 15, 2008, Solaris 7 exited EOSL Phase 2. Except through custom support agreements, all support for Solaris 7 is terminated. On April 1, 2009, Solaris 8 enters EOSL Phase 2. Vintage patch entitlement (for patches developed ON OR after 4/1/09) requires purchase of the Solaris 8 Vintage Patch Service. Learn more."

Thursday, July 03, 2008

RAM->SSD->DISK + ZFS

Very interesting article written by Adam Leventhal.

I actually like the idea of L2ARC especially on low-end systems. Imagine a 1U or 2U x86 box with one internal disk being a 143GB SSD used for L2ARC - basically you getting 144GB fast read cache for your MySQL database (or anything else). You probably can't even put that much memory in 1U or 2U system in a first place (not to mention cost).

Now imagine much larger database. You buy additional entry-level array like Sun's 2540 and you put 12x143GB SSD drives for L2ARC (and at least one for SLOG if required). This gives you about 1,5TB of cache! You can cache relatively large database here.

Now I like the way L2ARC works - if you unmount ZFS pool and mount it again old L2ARC content will be re-used (thanks to ZFS checksums possible stale data will be detected, skipped and read from disks). What it means is that if you connect your 2540 (or whatever) full of SSD drives to a cluster and you failover your database along with your ZFS pool to another node your 1,5TB of cache will be still warm. So the impact of failover on your database performance can be greatly reduced.

Of course there are other scenarios and I'm keen to do some testing... :)

Wednesday, July 02, 2008

SystemTap (lack of) progress

I've just read a discussion on SystemTap. In a way it is a very sad read - couple of years later and SystemTap is nothing more than a toy for some kernel developers while DTrace has been in stable form for years now and is being widely used by kernel developers, application developers, system administrators, database administrators, etc. There is also a growing eco-system around DTrace - not only it has been ported to other OS'es but also there is a support for DTrace in Perl, Ruby, PHP, Postgress, MySQL, Xorg, ...

The key point about DTrace is that it just works and does its job.

It's been in use for so long now that it is no longer that exciting to use it - it is rather utterly frustrating when there is no DTrace around...

Bryan posted interesting comment about SystemTap.

Tuesday, July 01, 2008

NetApp vs. ZFS

Here is an update on the law-suit. If you are interested please also read a declaration by Dave Hitz. He basically does confess that NetApp is scared of ZFS and that it makes NetApp out of business. I agree - while I like NetApp from the economic point of view in most cases it doesn't make sense to buy it - build your storage box yourself using ZFS. Ok, you need certain skills to do so... but I'm 100% confident that sooner or later we will see appliances based on ZFS.

It's really sad that instead of innovating NetApp is going to court...

Thursday, June 12, 2008

ZFS + SSD

Two interesting posts about ZFS and SSD: 1 2

ps. look for ZFS L2ARC and ZFS SLOG

Tuesday, June 10, 2008

IP Instancec and CE interfaces

If you need to get ce interface working with IP instances with Zones on Solaris 10 - now you finally can. Go for patch 137086-01.

Snow Leopard

Mac OS Server Edition code named Snow Leopard has been announced and they did include ZFS.

Sunday, May 18, 2008

Friday, May 16, 2008

uperf - benchmarking network

"Heard of filebench? Want something similar for networking? Look no further! Today we opensourced uperf, a tool to benchmark networking performance. uperf, just like its cousin filebench,1 is a framework that takes a description of a workload/application (called a profile), and generates load to match the profile. uperf is quite heavily used by the performance groups at Sun to study networking performance."
Read More.

Google Translate Adds New Languages

Google Translate adds 10 new languages, among them is Polish. Read more about it.

Wednesday, May 14, 2008

ZFS WriteThrottling

ZFS had a problem with properly throttling intensive writers like a simple dd if=/dev/null of=/zfs/file which would usually produce "jumpy" writes instead os steady write stream. There is a new way of throttling in ZFS which should solve the problem - I have not tested it yet. The new code was integrated into build 87. Roch has posted a good explanation of the old and the new behavior.

Tuesday, April 22, 2008

Friday, April 18, 2008

Nevada New Features

Some selected by me new features in last several builds up-to build 88
  • New mega_sas driver which supports the LSI SAS1078 SAS RAID Storage Adapters and Dell PERC 5/E, 5/i, 6E, 6/i and CERC 6/i RAID Controllers
  • Project Brussels has been integrated - standarized and easy way to configure network interfaces (no more custom ndd scripts, etc.)
  • 3D driver for ATI Radeon
  • xVM - aka XEN has been integrated
  • NEw native CIFS server
  • CIFS client inegrated
  • Default router for a Zone
  • DTrace NFSv3 Provider
  • DTrace NFSv4 Provider
  • ZFS L2ARC
  • ZFS cachefile property (quicker imports for large SANs)
  • ZFS zpool failmode
  • Several network and wifi drivers have been integrated

Of course there is much much more.

Thursday, April 10, 2008

128 Virtual CPUs in 1U

Sun has revealed two new Niagara servers: T5140 and T5240. Both of them have two UltraSparcT2+ processors which gives you 16 physical cores in a system with 128 threads in total. T5240, which is 2U version, also allows you to put up-to 16 SAS disks and up-to 128GB of memory - wow!

Prices - recently Sun re-priced old Niagara systems and now you can buy T5240 with 2x Niagara for the same price you could buy old 1x CPU Niagara system just a week ago - cool.

While T2+ has slighlty faster memory controllers but it has only two of them per CPU instead of four in T2... I wonder how it will impact performance.

Couple of technical blogs on new servers: 1 2 3 4

Tuesday, April 01, 2008

Solaris Cluster 3.2 U1

Change log:

Core enhancements:
  • Support for Logical Domains
  • Support of EMC SRDF in a campus cluster configuration
  • Support for Sun StorageTek NAS storage
  • 8 nodes cluster configuration with x64 servers
  • Support for Veritas File System on x64 servers
  • Service Tags
  • GUI enhancements
Geographic Edition enhancements:
  • Support for Oracle RAC 10g/11g with EMC SRDF & HDS TrueCopy
  • Support for HDS TrueCopy on X64 servers
  • Support for EMC SRDF/S on X64 servers
  • Support for Solaris Containers
New supported applications and agents:
  • Oracle 11g RAC supported with Solaris Cluster on SPARC
  • New applications version supported
  • Support for Solaris Containers for Linux and Solaris 8 in HA Container Agent

Monday, March 31, 2008

ZFS De-Duplication

UPDATE: ZFS dedup finally integrated!

With integration of this RFE we are closer (hopefully) to ZFS buil-in de-duplication. Read more on Eric's blog.


Once ZFS re-writer and de-duplication are done in theory one should be able to do a zpool upgrade of current pool and de-dup all data which is already there... we will see :)

Eric mentioned on his blog that in reality we should use sha256 or stronger. I would go even further - two modes, one mode you depend entirely on block checksum and the other one where you actually compare byte-by-byte given block to be 100% sure they are the same. Slower for some workloads but safe.

Now, de-dup which "understands" your data would be even better (analyzing file contents - like emails, attachments and de-dup on attachment level, etc.), nevertheless block level one would be a good start.

Thursday, March 20, 2008

Solaris 8 in a Zone

Well, wha if you stuck with Solaris 8 for many reasons but need to get it on modern SPARC HW and better yet get it clustered? You also need MPxIO working with latest arrays, and if you could go with SC3.2 for free to limit costs that would be ideal...

Playground: 2x T5220, 1x 2530 SAS array, Solaris 10 U4, MPxIO, IPMP, Sun Cluster 3.2 with Zones agent, patch 126020-02 applied (support for Etude). Entire software is for free.

Then you install Etude - just two packages. Export your Solaris 8 root file system over nfs or create a flar archive. Now you create a Solaris Branded Zone with Solaris 8 emulation providing exported Solaris 8 or flar archive as a source and a moment later you have a working copy of your Solaris 8 system in a Solaris 10 zone - cool!

Now you configure that Zone under a cluster (couple of commands) and you got it clustered so you can switch that zone between systems.

So far so good. Next week more functional tests and some basic application testing. If it will go well then we will switch production to it.

Last phase? Create another Zone - this time Solaris 10 zone (standard one), put it under a cluster and migrate one by one applications between zones doing some cleaning at the same time.

In a mean time we will provide better reliability due to clustering, better performance due to faster storage, more RAM and more CPU power.

Not only it allows you to use recent HW and rapid migration, but since it's running on Solaris 10 you also benefit from technologies like ZFS (yes, etude zone can be on zfs), Dtrace, resource management, etc.

How hard is it to set-up? Actually very easy, way easier than you think.

ZFS Encryption

At yesterday's LOSUG Darren Moffat, Sun Senior Staff Engineer presented current status of ZFS encryption. It was really interesting presentation. He even managed to panic system :)

The good thing is it's going to be very easy to use and is going to be integrated relatively soon - IIRC about build 92. It was also nice to be able to talk to him after his presentation and share some thoughts.

If you are from London area I think it would be worthwhile to pop-in at LOSUG meeting - you can always learn something new or meet new people.

Tuesday, March 18, 2008

S10 & ZFS - important patch

If you are using ZFS on Solaris 10 and experiencing some problems you should be interested in 127729-07 (x86) and 127728-06 (SPARC). Fixes introduced in last revision:

Problem Description: 
6355623 zfs rename to valid dataset name, but if snapshot name becomes too long, panics system
6393769 client panic with mutex_enter: bad mutex, at get_lock_list
6513209 destroying pools under stress causes hang in arc_flush
6523336 panic dr->dt.dl.dr_override_state == DR_NOT_OVERRIDDEN,
file: ../../ common/fs/zfs/dbuf.c line: 2195
6533813 recursive snapshotting resulted in bad stack overflow
6535160 lock contention on zl_lock from zil_commit
6544140 assertion failed: err == 0 (0x11 == 0x0), file: ../../common/fs/zfs/zfs_znode.c, line: 555
6549634 dn_dbfs_mtx should be held when calling list_link_active() in dbuf_destroy()
6557767 assertion failed: error == 17 || lr->lr_length <= zp->z_blksz
6565044 small race condition between zfs_umount() and ZFS_ENTER()
6565574 zvol read perf problem
6569719 panic dangling dbufs (dn=ffffffff28814d30, dbuf=ffffffff20756008)
6573361 panic turnstile_block, unowned mutex
6577156 zfs_putapage discards pages too easily
6581978 assertion failed: koff <= filesz, file: ../../common/fs/zfs/zfs_vnops.c, line: 2834
6585265 need bonus resize interface
6586422 deadlock occurs when nfsv4 recover thread calls nfs4_start_fop
6587723 BAD TRAP: type=e (#pf Page fault) occurred in module "zfs" due to NULL pointer dereference
6589799 dangling dbuf after zinject
6594025 panic: dangling dbufs during shutdown
6596239 stop issuing IOs to vdev that is going to be removed
6617844 seems bug 4901380 has not been fixed in Solaris 10
6618868 ASSERT: di->dr_txg == tx->tx_txg (0x148 == 0x147), dbuf.c, line 1088
6620864 BAD TRAP panic in vn_invalid() called through znode_pageout_func()
6637030 kernel heap corruption detected during stress