Monday, April 20, 2009

Oracle Agrees to Acquire Sun Microsystems

This is a big surprise!

From Oracle's document on the acquisition:
• Protects and extends customers’ investment in Sun technologies
• Accelerate growth of Java as an open industry standard development platform
• Sustain Solaris as an industry standard OS for Oracle software
• Continue Open Storage and Systems focus and innovation
• Ensure continued innovation and investment in Java technology
• Optimize Solaris and Oracle for better performance, reliability, and manageability
• Protects massive customer investment in SPARC
• Open Storage built with industry standard servers and components

Sun's Official Announcement
Wall Street Journal

Monday, April 06, 2009

truss(1M) vs. dtrace(1M)

One of the many benefits of DTrace vs. truss is that dtrace should induce much smaller overhead for tracing applications especially for multi-threaded applications running on multi core/cpu servers. Lets put it to a quick test.

I quickly wrote a small C program which spawns N threads and each thread does stat("/tmp") X times. Then I measured how much time it takes to execute it for 1mln stat()'s in total while running with no tracing at all, running under truss and running under dtrace.


One two-core AMD CPU
# ptime ./threads-2 1 1000000

real 2.662809885
user 0.223471401
sys 2.435895135

# ptime ./threads-2 2 500000

real 1.649542016
user 0.226104849
sys 3.045784378

# ptime truss -t xstat -c ./threads-2 2 500000

syscall seconds calls errors
xstat 6.966 1000000
stat64 .000 3 1
-------- ------ ----
sys totals: 6.966 1000003 1
usr time: .776
elapsed: 18.520

real 18.533000528
user 5.677239771
sys 16.069020190

# dtrace -n 'syscall::xstat:entry{@=count();}' -c 'ptime ./threads-2 2 500000'
dtrace: description 'syscall::xstat:entry' matched 1 probe

real 1.888294217
user 0.225676973
sys 3.506004575
dtrace: pid 8526 has exited

1000000

truss made the program to execute about 11x longer while dtrace made program to execute for about 14% longer.


Niagara server:

# ptime ./threads-2 1 1000000

real 10.873
user 1.881
sys 8.992

# ptime ./threads-2 10 100000

real 1.467
user 1.962
sys 12.121

# ptime truss -t xstat -c ./threads-2 1 1000000

syscall seconds calls errors
stat 26.958 1000004 1
-------- ------ ----
sys totals: 26.958 1000004 1
usr time: 2.758
elapsed: 214.600

real 3:34.613
user 30.900
sys 2:28.182

# ptime truss -t xstat -c ./threads-2 10 100000

syscall seconds calls errors
stat 37.259 1000004 1
-------- ------ ----
sys totals: 37.259 1000004 1
usr time: 3.178
elapsed: 168.010

real 2:48.063
user 1:05.709
sys 3:35.813

# dtrace -n 'syscall::stat:entry{@=count();}' -c 'ptime ./threads-2 1 1000000'
dtrace: description 'syscall::stat:entry' matched 1 probe

real 14.028
user 1.957
sys 12.069
dtrace: pid 12920 has exited

1000939

# dtrace -n 'syscall::stat:entry{@=count();}' -c 'ptime ./threads-2 10 100000'
dtrace: description 'syscall::stat:entry' matched 1 probe

real 1.858
user 2.142
sys 15.632
dtrace: pid 11679 has exited

1000083

truss made the program to execute about 20x longer in the single thread case and 115x longer for the multi threaded one while dtrace added no more than 30% to the execution time regardless if the application was running with one or many executing threads. This shows that one has to be especially careful when using truss on a multi CPU/core system on a multi-threaded application. Notice that the performance difference between multi-threaded and single-threaded example for truss shows not that much difference comparing to execution times with no tracing at all which shows the ugly feature of truss - it serializes a multi-threaded application.

Of course the benchmark is the worst-case scenario and in real life you shouldn't get that much overhead from both tools. Still truss in some cases could introduce too much overhead on a production server while dtrace would still be perfectly acceptable allowing you to continue with your investigation.

btw: DTraceToolkit provides a script called dtruss - it's a tool similar to truss but it is using DTrace.



cat threads-2.c


#include <thread.h>
#include <stdlib.h>
#include <pthread.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>

void *thread_func(void *arg)
{
int *N=arg;
int i;
struct stat buf;

for (i=0; i<*N; i++)
stat("/tmp", &buf);

return(0);
}

int main(int argc, char **argv)
{
int N, iter;
int i;
int rc;
pthread_t tid[255];

if (argc != 3)
{
printf("%s number_of_threads number_of_iterations_per_thread\n", argv[0]);
exit(1);
}

N = atoi(argv[1]);
iter = atoi(argv[2]);

for (i=0; i<N; i++)
{
if (rc = pthread_create(&tid[i], NULL, thread_func, &iter))
printf("Thread #%d creation failed [%d]\n", i, rc);
}


/* wait for all threads to complete */
for (i=0; i<N; i++)
pthread_join(tid[i], NULL);

exit(0);
}

Tuesday, March 31, 2009

ZFS Deduplicatuion This Summer?

Jeff Bonwick wrote:
"Yes -- dedup is my (and Bill's) current project. Prototyped in December.
Integration this summer. I'll blog all the details when we integrate,
but it's what you'd expect of ZFS dedup -- synchronous, no limits, etc."

The CPU Overclocks itself

Joerg reports:
"With the announcement of Intel Nehalem support in Solaris, we pointed to some interesting features, but from my perspective the power-aware dispatcher is the most interesting one. I wrote a while ago about the turbo boost feature of the Nehalem processors. The processor overclocks itself, when there is still head room in the power and thermal budget. It can overclock a core even higher, when other cores are in deep sleep. Otherwise it can make sense not to use a core for a single process, when there is enough compute power available otherwise you could put this core into a deep sleep mode just to save power. The new power-aware dispatcher in Solaris is aware of this side conditions and can dispatch the processes in a System accordingly. You will find more informations at the projects website."

Thursday, March 26, 2009

Trying too hard

From time to time I can see people trying to be too clever about some problems. What I mean by that is that sometimes they try too hard to use latest technologies to do something while there is already a solution which does the job. Or sometimes instead of taking a step back and taking a deep breath they dive directly into problem solving coming up with crazy ways to accomplish something. I guess it happens to all of us from time to time. This time it happened to me :) :)

A colleague approached me with a problem he had on some old Solaris 7 server which is stripped and customized and there is no pargs command there. He needed to get a full argument list of a running process but ps truncate it to 80 characters. Well I thought a simple C program should be able to extract the information via /proc. So me trying to be helpful I started to write it right a way. After some time I came up with:


bash-2.05# cat pargs.c

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <procfs.h>
#include <sys/procfs.h>
#include <sys/prsystm.h>



int main(int argc, char *argv[])
{
psinfo_t p;
char *file;
int fd;
int fd_as;
uintptr_t pargv[1024];
char arg[1024];
int i;

if(argc != 3)
{
printf("Usage: %s /proc/PID/psinfo\n", argv[0]);
exit(1);
}

file = argv[1];
fd = open(file, O_RDONLY);
if (fd == -1)
{
printf("Can't open %s file\n", file);
exit(2);
}

read(fd, &p, sizeof(p));
close(fd);

fd_as = open(argv[2], O_RDONLY);

printf("nlwp: %d\n", p.pr_nlwp);
printf("exec: %s\n", p.pr_fname);
printf("args: %s\n", p.pr_psargs);
printf("argc: %d\n", p.pr_argc);

pread(fd_as, &pargv, p.pr_argc * sizeof (uintptr_t), p.pr_argv);
for (i=0; i<p.pr_argc; i++)
{
pread(fd_as, &arg, 256, ((uintptr_t *)pargv)[i]);
printf(" %s\n", arg);
}

close(fd_as);
exit(0);
}



Job done.
Well couple of minutes later I realized that UCB version of ps is able to show long argument list...


bash-2.05# /usr/ucb/ps -axuww |grep "19179"
XXXX 19179 9.3 2.23998422056 ? S 11:02:30 0:02 /usr/java/bin/../bin/sparc/native_threads/java -classpath :./classes/packages/jakarta-regexp-1.3.jar:./classes/packages/classes12.zip:./classes/packages/mail.jar:./classes/packages/activation.jar:./classes/ MailSender
bash-2.05#


I had a good laugh at myself.

Tuesday, March 24, 2009

Library Interposer

Recently I have used Dtrace to change the output of uname() syscall. But if one wants a more permanent and selective approach it is easier to write a small library which would interpose the uname() syscall (well, actually uname() libC function and not a syscall itself). I slightly modified the malloc_interposer example.

After you compiled the library all you have to do is to LD_PRELOAD it in your script so everything started by that script will use it or you can LD_PRELOAD it only for a given binary as shown below. Additionally you have to set a variable uname_release to whatever string you like otherwise the library won't do anything.

# uname -a
SunOS test-server 5.10 Generic_125100-10 sun4u sparc SUNW,Sun-Fire-V440
#
# uname_release="5.7" LD_PRELOAD=./uname_interposer.so uname -a
SunOS test-server 5.7 Generic_125100-10 sun4u sparc SUNW,Sun-Fire-V440



# cat uname_interposer.c
/* Based on http://developers.sun.com/solaris/articles/lib_interposers_code.html#malloc_interposer.c
*/
/* Example of a library interposer: interpose on
* uname().
* Build and use this interposer as following:
* cc -o malloc_interposer.so -G -Kpic malloc_interposer.c
* setenv LD_PRELOAD $cwd/uname_interposer.so
* run the app
* unsetenv LD_PRELOAD
*/

#include <stdio.h>
#include <dlfcn.h>
#include <stdlib.h>

#include <sys/utsname.h>

int uname(struct utsname *name)
{
int rc;
char *release;

static int (*uname_func)(struct utsname *) = NULL;
if(!uname_func)
uname_func = (int (*)(struct utsname*)) dlsym(RTLD_NEXT, "uname");
rc = uname_func(name);
if (release=getenv("uname_release"))
strlcpy(name->release, release, _SYS_NMLN);

return(rc);
}
#

# gcc -fPIC -g -o uname_interposer.so -G uname_interposer.c

Thursday, March 12, 2009

When Free is Too Expensive

I like Jonathan Schwartz blog entries and his last post he clarifies Sun's business model. I like the funny part about free software - how true it is.
"When Free is Too Expensive
One of my favorite customer stories relates to an American company that did nearly 30% of its yearly revenue on Christmas Day. They were a mobile phone company, whose handsets appeared under Christmas trees, opened en masse and provisioned on the internet within about a 48 hour period. When we won the bid to supply their datacenter, their CIO gave me the purchase order on the condition I gave him my home phone number. He said, "If I have any issues on Christmas, I want you on the phone making sure every resource available is solving the problem." I happily provided it (and then made sure I had my direct staff's home numbers). Christmas came and went, no problems at all.

A year later, he was issuing a purchase order to Sun for several of our software products. To have a little fun with him (and the Sun sales rep), I told him before he passed me the purchase order that the products were all open source, freely available for download.

He looked at me, then at his rep, and said "What? Then why am I paying you a million dollars?" I responded, "You can absolutely run it for free. You just can't call me on Christmas day, you'll be on your own." He gave me the PO. At the scale he was running, the cost of downtime dwarfed the cost of the license and support.

Numerically, most developers and technology users have more time than money. Most readers of this blog are happy to run unsupported software, and we are very happy to supply it. For a far smaller population, the price of downtime radically exceeds the price of a license or support - for some, the cost of downtime is measured in millions per minute. If you're tracking packages or fleets of aircraft, running an emergency response networking or a trading floor, you almost always have more money than time. And that's our business model, we offer utterly exceptional service, support and enterprise technologies to those that have more money than time. It's a good business."

Saturday, March 07, 2009

Open Storage - What's Next?

If you wonder what's coming in storage area and also in ZFS in particular watch Open Solaris Storage Summit. To get your attention here is a list of some really exiting features coming to ZFS:

  • DeDuplication in ZFS
  • User Quotas in ZFS
  • Disk Eviction/Pool Shrinking
  • VSS Shadow Copies with ZFS Snapshots
  • Persistent L2ARC
  • ZFS Encryption
  • Lustre + ZFS
  • pNFS + ZFS

Wednesday, March 04, 2009

Oracle 8.0.6 on Solaris 10

I'm working on getting Oracle 8.0.6 32bit running on Solaris 7 migrated to Solaris 10. There is no branded zone for Solaris 7 and we have decided to try to run Oracle 8.0.6 directly on Solaris 10. Basically it just works. Basically... the problem was that some of a database files are larger than 2GB and Oracle fails to recover database on these files. After checking some log files and a little bit of dtrace'ing I found out that it does a stat() syscall on each db file before recovery starts and stat() fails with EOVERFLOW. So it uses wrong API... but it seems to work fine on Solaris 7 with the same binaries. It turned out that while Oracle is starting it is calling uname() to determine an OS version and based on that information it can change its behavior (like not using proper API to access large files). The easiest way is to use dtrace to intercept uname() syscall and put a fake output just before it returns. After that everything seems to be working fine.

Below dtrace script will put "5.7" string in uname() structure for every application calling uname() with uid=300 (oracle in my case). One might also write a small interposing library and LD_PRELOAD it while starting Oracle - that should also work.

#!/usr/sbin/dtrace -qs

#pragma D option destructive

syscall::uname:entry
/uid==300/
{
self->addr = arg0;
}

syscall::uname:return
/self->addr/
{
copyoutstr("5.7", self->addr+(257*2), 257);
}

Tuesday, February 24, 2009

Monday, February 23, 2009

BART on large file systems

I wanted to use bart(1M) to quickly compare contents of two file systems. But it didn't work...

# bart create /some/filesystem/
! Version 1.0
! Monday, February 23, 2009 (11:53:57)
# Format:
#fname D size mode acl dirmtime uid gid
#fname P size mode acl mtime uid gid
#fname S size mode acl mtime uid gid
#fname F size mode acl mtime uid gid contents
#fname L size mode acl lnmtime uid gid dest
#fname B size mode acl mtime uid gid devnode
#fname C size mode acl mtime uid gid devnode
#

And it simply exits.

# truss bart create /some/filesystem/
[...]
statvfs("/some/filesystem", 0x08086E48) Err#79 EOVERFLOW
[...]
#

It should probably be statvfs64() but lets check what's going on.

# cat statvfs.d
#!/usr/sbin/dtrace -Fs

syscall::statvfs:entry
/execname == "bart"/
{
self->on=1;
}

syscall::statvfs:return
/self->on/
{
self->on=0;
}

fbt:::entry
/self->on/
{
}

fbt:::return
/self->on/
{
trace(arg1);
}

# ./statvfs.d >/tmp/a &
# bart create /some/filesystem/
# cat /tmp/a
CPU FUNCTION
2 -> statvfs32
[...]
2 -> cstatvfs32
2 -> fsop_statfs
[...]
2 <- fsop_statfs 0 2 <- cstatvfs32 79 [...] 2 <- statvfs32 79 #

It should have used statvfs64() which should have used cstat64_32()

Seems like /usr/bin/bart is a 32bit binary compiled without largefile(5) awareness. There is a bug for exactly the case already opened - 6436517.

While DTrace wasn't really necessary here it helped me to very quickly see what is actually going on in kernel and why it fails especially with a glance at the source thanks to OS OpenGrok. It helped to find the bug.

ps. the workaround in my case is to temporarily set a zfs/fs quota on /some/filesystem to 1TB - and then it works.