Monday, December 28, 2009

Thursday, December 17, 2009

My Presentation at LOSUG

Yesterday I did a presentation at London Open Solaris User Group on the backup platform I implemented. It utilizes open source technologies like Open Solaris, ZFS, RSYNC and a commodity hardware to effectively offer us a better backup solution than NetBackup and for a fraction of a cost. You can download the slides here. Before you do so it might be worth reading my two previous blog entries: 1 2 which should provide some additional background.

Saturday, December 12, 2009

Read-Only Boot from ZFS Snapshot

PSARC 2009/670
Allow for booting from a ZFS snapshot.  The boot image
will be read-only. Early in boot a clone of the root
is created and used to provide writable storage for the
system image during its lifetime. Upon reboot, the
system image will reset to the same previous state.

Tuesday, December 01, 2009

VirtualBox 3.1 Released

This version is a major update. The following major new features were added:
  • Teleportation (aka live migration); migrate a live VM session from one host to another (see the manual for more information)
  • VM states can now be restored from arbitrary snapshots instead of only the last one, and new snapshots can be taken from other snapshots as well ("branched snapshots"; see the manual for more information)
  • 2D video acceleration for Windows guests; use the host video hardware for overlay stretching and color conversion (see the manual for more information)
  • More flexible storage attachments: CD/DVD drives can be attached to an arbitrary IDE controller, and there can be more than one such drive (the manual for more information)
  • The network attachment type can be changed while a VM is running
  • Complete rewrite of experimental USB support for OpenSolaris hosts making use of the latest USB enhancements in Solaris Nevada 124 and higher
  • Significant performance improvements for PAE and AMD64 guests (VT-x and AMD-V only; normal (non-nested) paging)
  • Experimental support for EFI (Extensible Firmware Interface; see the manual for more information)
  • Support for paravirtualized network adapters (virtio-net; see the manual for more information)

Tuesday, November 24, 2009

Long ssh logins

On a couple of our servers running Solaris we noticed that it usually takes more than 10s to login. Once in everything is a snap. I quickly investigated it and this turned out to be interesting. I used truss(1M) to investigate what's going on from the moment I connect to the moment I have a working shell.
# truss -f -o /tmp/a -v all -adDE -p 408
Now I logged in to the system and analyzed /tmp/a file. First I confirmed that it took over 10s to login. From the moment the connection was accepted to the moment I got interactive session it took about 11s as shown below:
[...]
408: 2.6594 0.0007 0.0000 fcntl(4, F_SETFL, (no flags)) = 0
[...]
12186: 14.0814 0.0001 0.0000 write(4, " | : { b7F S LB7A2 BA13".., 64) = 64
12196: read(0, 0x080473DF, 1) (sleeping...)
[...]
So I checked when it started to go wrong.
[...]
408: 2.6594 0.0007 0.0000 fcntl(4, F_SETFL, (no flags)) = 0
[...]
12196: 3.7245 0.0003 0.0003 forkx(0) = 12200
The connection started just before the fcntl showed above and everything is executing quick up-to forkx() at 3.7245s. So far it took a little more than 1s. What happens next seems to be a loop of hundreds of entries like:
[...]
12200: 4.5521 0.0000 0.0000 ioctl(3, ZFS_IOC_USERSPACE_ONE, 0x08046790) Err#48 ENOTSUP
12200: 4.5522 0.0001 0.0000 ioctl(7, MNTIOC_GETMNTENT, 0x08047C1C) = 0
12200: 4.5917 0.0395 0.0002 ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08046390) = 0
12200: 4.5918 0.0001 0.0000 getuid() = 35148 [35148]
12200: 4.5919 0.0001 0.0000 getuid() = 35148 [35148]
12200: 4.5919 0.0000 0.0000 door_info(6, 0x08046460) = 0
12200: target=189 proc=0x806FCD0 data=0xDEADBEED
12200: attributes=DOOR_UNREF|DOOR_NO_CANCEL
12200: uniquifier=289
12200: 4.5922 0.0003 0.0000 door_call(6, 0x080464D0) = 0
12200: data_ptr=FE430000 data_size=232
12200: desc_ptr=0x0 desc_num=0
12200: rbuf=0xFE430000 rsize=16384
12200: 4.5923 0.0001 0.0000 ioctl(3, ZFS_IOC_USERSPACE_ONE, 0x08046790) Err#48 ENOTSUP
12200: 4.5923 0.0000 0.0000 ioctl(7, MNTIOC_GETMNTENT, 0x08047C1C) = 0
12200: 4.6095 0.0172 0.0001 ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08046390) = 0
12200: 4.6096 0.0001 0.0000 getuid() = 35148 [35148]
12200: 4.6096 0.0000 0.0000 getuid() = 35148 [35148]
12200: 4.6097 0.0001 0.0000 door_info(6, 0x08046460) = 0
12200: target=189 proc=0x806FCD0 data=0xDEADBEED
12200: attributes=DOOR_UNREF|DOOR_NO_CANCEL
12200: uniquifier=289
12200: 4.6098 0.0001 0.0000 door_call(6, 0x080464D0) = 0
12200: data_ptr=FE430000 data_size=232
12200: desc_ptr=0x0 desc_num=0
12200: rbuf=0xFE430000 rsize=16384
12200: 4.6098 0.0000 0.0000 ioctl(3, ZFS_IOC_USERSPACE_ONE, 0x08046790) Err#48 ENOTSUP
12200: 4.6099 0.0001 0.0000 ioctl(7, MNTIOC_GETMNTENT, 0x08047C1C) = 0
12200: 4.6201 0.0102 0.0001 ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08046390) = 0
12200: 4.6202 0.0001 0.0000 getuid() = 35148 [35148]
12200: 4.6203 0.0001 0.0000 getuid() = 35148 [35148]
[...]
The process with PID 12200 was:
12200:   3.8229  0.0947  0.0013 execve("/usr/sbin/quota", 0x0811F9E8, 0x0811D008)  argc = 1
12200: *** SUID: ruid/euid/suid = 35148 / 0 / 0 ***
12200: argv: /usr/sbin/quota
By visually looking at couple of pages of these ioctls it looked like most of the total time would be spent in doing ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08046390). Lets check it:
# grep "^12200:" /tmp/a |grep ioctl|grep ZFS_IOC_OBJSET_STATS|awk 'BEGIN{i=0}{i=i+$3}END{print i}'
9.7412
So out of 11s above ioctls along took 9. To have a clear picture lets check how much time the quota command took:
# grep "^12200:" /tmp/a |head -1
12200: 3.7245 3.7245 0.0000 forkx() (returning as child ...) = 12196
# grep "^12200:" /tmp/a |tail -1
12200: 13.9854 0.0003 0.0000 _exit(0)
So it took about 10s which means almost 100% of its time was spent doing above ioctls.
There are almost 300 zfs filesystems on this particular server so it all adds up. Sometimes quota completes very quickly sometimes it takes many seconds - I guess depending if requested data from all these zfs filesystems is cached or not. You need to run quota as a non-root user otherwise most checks are skipped and it is always quick.
Since we are not using quota on these systems anyway I commented out quota check in /etc/profile and now a full login takes about 1s on average which is 10-12x improvement.

Friday, November 20, 2009

Xen 3.4

Xen 3.4 integrated yesterday. It should appear in snv_129.

Wednesday, November 11, 2009

VirtualBox 3.1.0 Beta 1 released

VirtualBox 3.1.0 Beta 1 released.

I'm especially interested in "support for OpenSolaris Boomer architecture" which hopefully means a working microphone in a guest on my laptop - that would mean a working Skype on VB/Windows :)

Version 3.1 will be a major update. The following major new features were added:
  • Teleportation (aka live migration); migrate a live VM session from one machine to another
  • VM states can now be restored from arbitrary snapshots instead of only the last one, and new snapshots can be taken from other snapshots as well (aka branched snapshots)
  • 2D video acceleration for Windows guests; use the host video hardware for overlay stretching and colour conversion
  • The network attachment type can be changed while a VM is running
  • Experimental USB support for OpenSolaris hosts making use of the latest USB enhancements in Solaris Nevada 124 and higher.
  • Significant performance improvements for PAE and AMD64 guests (VT-x and AMD-V only; normal (non-nested) paging)
  • Experimental support for EFI (Extended Firmware Interface)
  • VirtIO network device support

In addition, the following items were fixed and/or added:
  • VMM: reduced IO-APIC overhead for 32 bits Windows NT/2000/XP/2003 guests; requires 64 bits support (VT-x only; bug #4392)
  • VMM: fixed double timer interrupt delivery on old Linux kernels using IO-APIC (caused guest time to run at double speed; bug #3135)
  • VMM: reinit VT-x and AMD-V after host suspend or hibernate; some BIOSes forget this (Windows hosts only; bug #5421)
  • GUI: prevent starting a VM with a single mouse click (bug #2676)
  • 3D support: major performance improvement in VBO processing
  • 3D support: added GL_EXT_framebuffer_object, GL_EXT_compiled_vertex_array support
  • 3D support: fix crashes in FarCry, SecondLife, Call of Duty, Unreal Tournament, Eve Online (bugs #2801, #2791)
  • 3D support: fix graphics corruption in World of Warcraft (#2816)
  • iSCSI: support iSCSI targets with more than 2TiB capacity
  • VRDP: fixed occasional VRDP server crash (bug #5424)
  • Network: fixed the E1000 emulation for QNX (and probably other) guests (bug #3206)
  • Network: even if the virtual network cable was disconnected, some guests were able to send / receive packets (E1000; bug #5366)
  • Network: even if the virtual network cable was disconnected, the PCNet card received some spurious packets which might confuse the guest (bug #4496)
  • VMDK: fixed handling of split image variants
  • VHD: fixed incompatibility with Hyper-V
  • OVF: create manifest files on export and verify the content of an optional manifest file on import
  • X11 based hosts: allow the user to specify their own scan code layout (bug #2302)
  • Mac OS X hosts: don't auto show the menu and dock in fullscreen (#bug 4866)
  • Solaris hosts: combined the kernel interface package into the VirtualBox main package
  • Solaris hosts: support for OpenSolaris Boomer architecture (with OSS audio backend).
  • Shared folders: fixed changing case of file names (bug #2520)
  • Shared folders: VBOXSVR is visible in Network folder (bug #4842)
  • Windows and Linux Additions: added balloon tip notifier if VirtualBox host version was updated and Additions are out of date
  • Solaris Additions: fixed as_pagelock() failed errors affecting guest properties (bug #5337)
  • Windows Additions: added automatic logon support for Windows Vista and Windows 7
  • Windows Additions: fix crash in seamless mode (contributed by Huihong Luo)
  • Linux Additions: added support for uninstalling the Linux Guest Additions (bug #4039)
  • SDK: added object-oriented web service bindings for PHP5

Tuesday, November 10, 2009

ZFS send dedup

PSARC/2009/557 ZFS send dedup was integrated yesterday. It allows to dedup zfs send|recv stream regardless if dedup is enabled or not on a sending and/or receiving side. Looks like it both pool and zfs send stream deduplication will be available in snv_128. Of course you can pull the sources and build them yourself if you want to start playing with dedupe now otherwise you will have to wait to the beginning of December when snv_128 should hit /dev Open Solaris repository.

Thursday, November 05, 2009

No need for fsck in ZFS

Recently there was an article at OSNEWS " Should ZFS Have a fsck Tool?". Well, no it shouldn't. I wanted to write an explanation why it is the case but Joerg was first and there is no point me repeating him. So if you wonder why ZFS doesn't need a fsck tool read Joerg blog entry about it.

Monday, November 02, 2009

ZFS Deduplication Integrated!

It took more than expected but it has been finally integrated! Read Jeff Bonwick's post on ZFS dedup.
PSARC 2009/571 ZFS Deduplication Properties
6677093 zfs should have dedup capability
You can find code changes here.

Sunday, October 25, 2009

Apple Abandons ZFS

According to Jeff Bonwick (one of the main developers of ZFS) the real reason is:

> Apple can currently just take the ZFS CDDL code and incorporate it
> (like they did with DTrace), but it may be that they wanted a "private
> license" from Sun (with appropriate technical support and
> indemnification), and the two entities couldn't come to mutually
> agreeable terms.

I cannot disclose details, but that is the essence of it.

Jeff

Friday, October 23, 2009

Solaris 10 on Open Solaris

Yesterday a new zone brand was integrated into Open Solaris: Solaris 10 Zones. It allows running Solaris 10 in a zone on Open Solaris. The technology will be very useful in some instances for enterprise customers once Solaris 11 is out or when deploying Open Solaris. This is very useful if you want to make use of latest hardware platforms and/or technologies delivered in Open Solaris (and Solaris 11 in the future) and you need to migrate your older environments running Solaris 8, 9 or 10 without introducing changes to them (to mitigate a risk or due to lack of resources to do a more detailed validation of a new environment). This is especially true with Open Solaris and Solaris 11 as they are so much different than older Solaris releases. Branded zones are a very neat solution to provide backward compatibility without putting too much hurdle on innovation or having most of the users who do not need it to pay the price of it.

See Jerry's blog entry about Solaris 10 Zones.

Oracle, MySQL, EU and money

The whole MySQL issue with Oracle and EU is just silly. I'm not going to repeat arguments here as there are many comments you can find and I believe that all that could be said about it has already been said. If you are interested in what is going on and why then read these two articles which summarize the whole issue very well. Here is one from Groklaw and this one is also very interesting.

It is really sad to see to what extend some people are greedy...
I think that RMS and Monty lost whatever credibility they had left. They definitely lost it completely in my eyes. Not that Monty cares as he is after more money here... He wants to cash-in two times... Call me idealist or whatever but I honestly believe that life and business is not *only* about money.

Wednesday, October 21, 2009

zfs set dedup=on

There are two PSARC cases regarding ZFS deduplication which are expected to be approved today.

PSARC 2009/571 ZFS Deduplication Properties
PSARC 2009/557 ZFS send dedup

Hopefully it means that ZFS deduplication will be finally integrated soon.

Wednesday, October 14, 2009

IBM's Black Magic

Somehow I'm not surprised that comments section is disabled in this example of IBM's Black Magic. I won't comment on it myself as there is no point in repeating Joerg arguments - I agree with him in 100%. This is about recent world record in TPC-C benchmark from Oracle/Sun.

However I would like to suggest to Ms. Stahl that there is another big advantage in IBM's setup she's missed - not only IBM did more tpmC/core but they also did:
  • larger number or racks per tpmC
  • larger $/tpmC
  • larger power usage/tpmC
I think that somehow most customers would consider above three as more important than tpmC/core (assuming they would consider TPC-C relevant in the first place).

Friday, October 09, 2009

Win 10$ Million



I already like the Oracle marketing :)

I only hope that while keeping it aggressive and provocative they won't end-up like IBM with their black magic style benchmarks and comparisons... If anything from Sun's culture will survive the acquisition it is rather unlikely to happen.

Thursday, October 08, 2009

300k SPC-1 IOPS from IBM

This is quite impressive result. On the other hand the price per SPC is rather expensive and one wonders why they used a server with 64 cores and used only 48 of them. Is there a technical reason? Anyone knows?

Thursday, September 17, 2009

Improved stat() performance on files on zfs

Bug ID: 6775100 stat() performance on files on zfs should be improved was fixed in snv_119.
I wanted to do a quick comparison between snv_117 and snv_122 on my workstation to see what kind of improvement there is. I wrote a small C program which does a stat() N times in a loop. This is of course a micro-benchmark. Additionally it doesn't cover a case if doing stat() on not cached entries in DNLC has been improved too.

So I run the program several times on each build after a fresh restart. These were the numbers I was getting on average:

snv_117/ZFS# ptime ./stat_loop test 1000000

real 1.941163151
user 0.219955617
sys 1.707997800

snv_122/ZFS# ptime ./stat_loop test 1000000

real 1.089193770
user 0.199055005
sys 0.889680683

snv_122/UFS# ptime ./stat_loop test 1000000

real 0.905696133
user 0.187513753
sys 0.716955921
This is over 40% improvement in performance of stat() on ZFS - nice.
Still stat() on UFS is faster by about 17%.

The fix could also help some very busy NFS servers :)
AFAIK it has not been backported to Solaris 10 so if you think you need it either go for Open Solaris or open a case with Sun's support and ask for a fix for S10.

Friday, August 28, 2009

Oracle: Sun vs. IBM



"Oracle and Sun together are hard to match. Just ask IBM. Its fastest server now runs an impressive 6 million TPC-C transactions, but on October 14 at Oracle OpenWorld, we'll reveal the benchmark numbers that prove that even IBM DB2 running on IBM's fastest hardware can't match the speed and performance of Oracle Database on Sun systems. Check back on October 14 as we demonstrate Oracle's commitment to Sun hardware and Sun SPARC."

Tuesday, August 11, 2009

SXCE To Be EOL'ed

We all knew it will happen sooner or later. For some people it is probably too soon for others like me it doesn't really matter as they have been deploying Open Solaris distribution instead of SXCE for quite some time. While it might seem like a little bit pre-mature decision I believe that it will allow to better utilize available resources so the open solaris community can focus more on what's in front of us rather then putting their time into SXCE which has no future...

The official announcement:
Sun is announcing the intent to discontinue production of the Solaris Express Community Edition (SXCE) by the end of October time-frame. As we intend to continue on a bi-weekly build schedule, consolidations will move towards producing native Image Packaging System (IPS) packages alongside SVR4 packages and then phase out the latter completely. Technologies such as IPS, Automated Install, Snap Upgrade and the Distribution Constructor will be integrating into a consolidation after following through the established processes including architectural (ARC) review.

We recognize that this transition will require some effort for all members of the OpenSolaris development community, and are committed to working with all of you in making that transition a success. You can expect updated information from us and the communities which manage the consolidations as we further plan the transition schedules.

Questions can be directed to David Comay, Glynn Foster, William Franklin, Stephen Hahn, Dave Miner, Vincent Murphy, or Dan Roberts.

Monday, August 10, 2009

Read or Write Only Process

PSARC/2009/378:
This project proposes two new "basic" privileges.

FILE_READ
Allows a process to read a file or directory whose
permission or ACL allow the process read permission.

FILE_WRITE
Allows a process to write a file or directory whose
permission or ACL allow the process write permission.

The purpose of these privileges is the ability to create a "read-only" (no FILE_WRITE privilege) and a "write-only" (no FILE_READ privilege) process.

The FILE_WRITE basic privilege is required for any modification to a file or directory: open(2), creat(2), link(2), symlink(2), rename(2), unlink(2), mkdir(2), rmdir(2), mknod(2) etc.

The FILE_READ basic privilege is required for opening a file with O_RDONLY or O_RDWR.

Note: a "basic" privilege is a privilege which is part of the default I, P and E privilege set.

ZFS: logbias

PSARC/2009/423:
Summary

Provide zfs with the ability to control the use of resources used for synchronous (eg fsync, O_DSYNC) requests. In particular it enables substantially better performance for Oracle and potentially other applications.

Background

Oracle manages two major types of files, the Data Files and the Redo Log files. Writes to Redo Log files are in the path of all transactions and low latency is a requirement. It is critical for good performance to give high priority to these write requests.

Data Files are also the subject of writes from DB writers as a form of scrubbing dirty blocks to insure DB buffer availability. Write latency is much less an issue. Of more more importance is achieving an acceptable level of throughput. These writes are less critical to delivered performance.

Both types of writes are synchronous (using O_DSYNC), and thus treated equally. They compete for the same resources: separate intent log, memory, and normal pool IO. The Data File writes impede the potential performance of the critical Redo Log writers.

Proposal

Create a new "logbias" property for zfs datasets.

If logbias is set to 'latency' (the default) then there is no change from the current implementation. If the logbias property is set to 'throughput' then intent log blocks will be allocated from the main pool instead of any separate intent log devices (if present). Also data will be written immediately to spread the write load thus making for quicker subsequent transaction group commits to the pool.

To change the property, an admin will use the standard zfs set command:

# zfs set logbias=latency {dataset}
# zfs set logbias=throughput {dataset}

Monday, July 27, 2009

Radiation-Hardened SPARC

Atmel Introduces the AT697F Radiation-Hardened SPARC V8 Processor for Space Missions.

Joerg explains on why SPARC?:
"The reason is not Solaris. It´s a different one. Despite general opinion, SPARC isn´t a proprietary architecture. You get go to SPARC International, get the specification for a 100 bucks or so and you can develop and manufacture your own SPARC CPU. This was used by the European Space Agency to develop a radiation hardened version of a SPARC architecture. The development is called LEON, the proc named in the article uses the LEON2-FT design. The LEON2 and LEON3 design are available under the GNU General Public License respectively under the GNU Lesser General Public License. Being able to get such an architecture essentially for cheap money was the essential reason behind the decision for SPARC (besides of other technical reasons)"

Monday, July 20, 2009

RAID-Z3

RAID-Z3 (triple-parity RAID) has been implemented and integrated into snv_120 (bug id 6854612) - thanks to Adam Leventhal. Read more on Adam's blog.

Wednesday, July 15, 2009

Windows 7

Alright Ben you've convinced me to try Windows 7 especially as you can download it and try it for free.

Friday, July 03, 2009

MySQL - 8x Performance Improvement

I came across an interesting problem today. A perl script running a mysql query and it takes too much time to complete. It spends almost all its time while waiting for this mysql query (anonymized):

select a, b, registered, c, d from XXX where date(registered) >= date_sub(date(NOW()),interval 7 day)

The problem is that there are over 70 million rows in the XXX table and the query takes over 7 minutes to complete mostly waiting for a disk I/O.

explain select a, b, registered, c, d from XXX where date(registered) >= date_sub(date(NOW()),interval 7 day)\G

id: 1
select_type: SIMPLE
table: XXX
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 72077742
Extra: Using where
1 row in set (0.02 sec)


So the reason it is so slow is that mysql does not use an index here. It turned out that if you use a function on a column in a where statement then mysql won't use an index!

There is a reason why the statement is using date() and date_sub() functions as it is expected to compare dates where time is 00:00:00 so I can't . But one can cast() functions to timestamp and used registered directly which will allow mysql to use index:
explain select a, b, registered, c, d from XXX where registered >= cast(date_sub(date(NOW()),interval 7 day) as datetime)

id: 1
select_type: SIMPLE
table: XXX
type: range
possible_keys: YYY
key: YYY
key_len: 9
ref: NULL
rows: 1413504
Extra: Using where
1 row in set (0.10 sec)
After the modification the script takes about 50s to execute compared to over 7 minutes which is a very nice 8x performance improvement! Not to mention a much less impact on a database server.

Monday, June 22, 2009

OpenSolaris Apps of Steel Challenge

Recently I participated in OpenSolaris Apps of Steel Challenge and I won a laptop!
My Toshiba Portégé® R600 arrived this morning and it comes pre-installed with Open Solaris. First impressions are really good - it is so light.

It is a fortunate coincidence that I'm on a short holiday right now as I will have more time to play with it :) Already doing an upgrade.

Thank you Sun you've made my day!

Friday, June 19, 2009

Intercepting Process Core Dumps

We disabled process core dumps on one of our environments but still we want to know when it happens along with some more information on the even.

root@ dtrace -q -n fbt:genunix:core:entry \
'{printf("%Y exec: %s args: %s cwd: %s pid: %d zone: %s signal: %d\n", \
walltimestamp, curpsinfo->pr_fname, curpsinfo->pr_psargs, cwd, pid, \
zonename, arg0);}' >/local/tmp/process_cores.log

Now lets try to kill a process so it tries to dump a core:

root@ bash -x
root@ kill -SIGBUS $$
+ kill -SIGBUS 14054
Bus Error (core dumped)
root@

root@ tail -1 /local/tmp/process_cores.log
2009 Jun 19 16:07:54 exec: bash args: bash -x cwd: /home/milek pid: 14054 zone: global signal: 10
root@
The overhead of running the script is practically none unless you're trying to dump as many core dumps as possible per second and even then the overhead should be relatively small :)

Thursday, June 18, 2009

Cognitive Computing via Synaptronics and Supercomputing (C2S2)

Cognitive Computing via Synaptronics and Supercomputing (C2S2):
"By seeking inspiration from the structure, dynamics, function, and behavior of the brain, the IBM-led cognitive computing research team aims to break the conventional programmable machine paradigm. Ultimately, the team hopes to rival the brain’s low power consumption and small size by using nanoscale devices for synapses and neurons. This technology stands to bring about entirely new computing architectures and programming paradigms. The end goal: ubiquitously deployed computers imbued with a new intelligence that can integrate information from a variety of sensors and sources, deal with ambiguity, respond in a context-dependent way, learn over time and carry out pattern recognition to solve difficult problems based on perception, action and cognition in complex, real-world environments."

Solaris Technologies Introduced for Oracle Database

Happy Birthday

Yesterday's LOSUG was a little bit surreal as we were singing happy birthday to OpenSolaris and had a birthday cake and a champagne.

Wednesday, June 17, 2009

VirtualBox 3.0.0 Beta1

This version is a major update. The following major new features were added:
  • Guest SMP with up to 32 virtual CPUs (VT-x and AMD-V only)
  • Windows guests: ability to use Direct3D 8/9 applications / games (experimental)
  • Support for OpenGL 2.0 for Windows, Linux and Solaris guests

In addition, the following items were fixed and/or added:
  • Virtual mouse device: eliminated micro-movements of the virtual mouse which were confusing some applications (bug #3782)
  • Solaris hosts: allow suspend/resume on the host when a VM is running (bug #3826)
  • Solaris hosts: tighten the restriction for contiguous physical memory under certain conditions
  • VMM: fixed occassional guru meditation when loading a saved state (VT-x only)
  • VMM: eliminated IO-APIC overhead with 32 bits guests (VT-x only, some Intel CPUs don’t support this feature (most do); bug #638)
  • VMM: fixed 64 bits CentOS guest hangs during early boot (AMD-V only; bug #3927)
  • VMM: performance improvements for certain PAE guests (e.g. Linux 2.6.29+ kernels)
  • GUI: added mini toolbar for fullscreen and seamless mode (Thanks to Huihong Luo)
  • GUI: redesigned settings dialogs
  • GUI: allow to create/remove one host-only network adapters
  • GUI: display estimated time for long running operations (e.g. OVF import/ export)
  • GUI: Fixed rare hangs when open the OVF import/export wizards (bug #4157)
  • VRDP: support Windows 7 RDP client
  • Networking: fixed another problem with TX checksum offloading with Linux kernels up to version 2.6.18
  • VHD: properly write empty sectors when cloning of VHD images (bug #4080)
  • VHD: fixed crash when discarding snapshots of a VHD image
  • VBoxManage: fixed incorrect partition table processing when creating VMDK files giving raw partition access (bug #3510)
  • OVF: several OVF 1.0 compatibility fixes
  • Shared Folders: sometimes a file was created using the wrong permissions (2.2.0 regression; bug #3785)
  • Shared Folders: allow to change file attributes from Linux guests and use the correct file mode when creating files
  • Shared Folders: fixed incorrect file timestamps, when using Windows guest on a Linux host (bug #3404)
  • Linux guests: new daemon vboxadd-service to handle time syncronization and guest property lookup
  • Linux guests: implemented guest properties (OS info, logged in users, basic network information)
  • Windows host installer: VirtualBox Python API can now be installed automatically (requires Python and Win32 Extensions installed)
  • USB: Support for high-speed isochronous endpoints has been added. In addition, read-ahead buffering is performed for input endpoints (currently Linux hosts only). This should allow additional devices to work, notably webcams.
  • NAT: allow to configure socket and internal parameters
  • Registration dialog uses Sun Online accounts now.

Tuesday, June 16, 2009

Turbo-Charging SVr4 Package Install

Have you ever been frustrated by slow patching or package installation on Solaris 10? Looks like the issue has been partially addressed by PSARC 2009/173. Hopefully it will be integrated into Solaris 10 soon.

Wednesday, June 10, 2009

GCC vs. Sun Studio

It all depends on your code, used options, etc. of course.
In a past I did some comparisons on my own and the difference usually wasn't that big and Studio not always provided better results - but that was couple of years ago.

Wednesday, June 03, 2009

ld.so.1: picld: fatal: libpsvcobj.so.1: open failed: No such file or directory

If you noticed that 'prtdiag -v' doesn't print all information it should and you get a below error in a system log file:
picld[165]: [ID 740666 daemon.crit] ld.so.1: picld: fatal: libpsvcobj.so.1: open failed: No such file or directory
it means you probably hit bug: 6780957
There is a workaround proposed in the bug however if you are running on ZFS as a root-fs then:
"Due to the random nature of how ZFS stores files in a directory, the workaround may or may not work."
I was unlucky this time on one server and lucky on another one. Despite me trying to populate the plugins directory several times in different ways I was still unlucky. There are many ways how to workaround the issue. For example the below workaround works fine for me:
root# svccfg -s picl setenv LD_LIBRARY_PATH "/usr/platform/sun4u/lib"
root# svcadm refresh picl
root# svcadm restart picl

Friday, May 29, 2009

YAIMA

IBM has posted yet another marketing article. I don't know if it is funny or irritating - perhaps both. It is the usual pseudo-technical article from IBM - International Bureau for Misinformation? :) Don't get me wrong - I like IBM and I often admire of what they are doing and not only in a server market-space but for science in general. It's just that their server division seems to be made of only a marketing people and nothing more. And not entirely honest ones...

So lets go thru some of the revelations in the article.

"HP-UX, HP's flavor of UNIX, is now up to release 11iV3. HP-UX is based on System V and runs on both HP9000 RISC servers and HP Integrity Itanium systems. In this respect, it is similar to Solaris, which can run on their SPARC RISC architecture, as well as x86 machines. AIX can only run on the POWER® architecture; however, given how UNIX is a high-end operating system, it is a positive thing that AIX and the POWER architecture are tightly integrated."

How ridiculous it is! Do they want to imply that HP-UX or Solaris are not tightly integrated with their respective RISC platforms? Of course they are. The fact is that AIX and HP-UX do not run on the most commonly used platform these days: x86/x64. And this is one of the reasons why they are dying platforms. Sure, they will stay in market for a long time mostly due to the fact that a lot of enterprise customers won't/can't migrate quickly off them. But if you are building a new environment in almost all cases there is no point in deploying AIX or HP-UX, no point for a customer it is.

Later in the document there is a section called "Solaris innovations" however they do not actually list Solaris innovation only a couple of selected feature updates to the 10/08 releases (there is already the 05/09 release for some time). From the marketing point of view it is very clever as if you are not reading carefully the article you would probably be under impression that there aren't many innovations in Solaris... What about DTrace? SMF? FMA? Branded Zones? Resource Management, Recent improvements for Intel platform (intelligent power management, MPO, etc.), etc.

Then they and the section with another astonishing claim:
"These recent improvements to ZFS are very important. When ZFS first came out, it looked incredible, but the root issue was a glaring omission in feature functionality. With this ability now added, Solaris compares favorably in many ways to JFS2 from AIX and VxFs from HP."
What do they smoke? There isn't even sense in trying to argue with it, IBM can only wish it was true. You won't even find most of the ZFS features everyone cares so much about in IBM's JFS2. ZFS is years more advanced than JFS2 - it's a completely different kind of technology and I doubt IBM will ever catch up with JFS2 - it would probably be easier to write an fs/lvm from scratch or port ZFS...

Later on they move to "AIX innovations" and they start with:
"AIX 6.1, first released about two years ago, is now available in two editions: standard, which includes only the base AIX, and the Enterprise edition, which includes workload partition manager and several Tivoli® products."
I hope it is not supposed to be an innovation... well I prefer Linux or Solaris model when you get ALL the features in standard OS and entirely for free. And it doesn't matter if it is a low-end x86 server or your laptop or if it is a large SPARC server with over 100+ cores and terabytes of memory... you still use the same Solaris with all the features and for free if you do not need a support.

One of the listed "innovations" in AIX are WPARs... well they provided that functionality many years after Solaris had Zones (which in turn were inspired by BSD's Jails). It will take at least couple of years for WPARs to mature... then they claim that "No other UNIX can boast the ability to move over running workloads on a workload partition from one system to another without shutting down the partition". Well, it is not true. You can live migrate LDOMs on Solaris, you can live migrate xVM guests on Solaris and you can live migrated XEN guests on Linux. Well, IBM is trying to catch up here again...

Then there is a lot of false (or incomplete) claims about virtualization technologies in other OS'es to AIX.

They claim AIX can do:
"Micro-partitioning: This feature allows you to slice up a POWER CPU on as many as 10 logical partitions, each with 1/10th of a CPU. It also allows for the capability of your system to exceed the amount of entitled capacity that the partition has been granted. It does this by allowing for uncapped partitions."
Well, it is all great but... I don't know about HP-UX but on Solaris you can slice you CPU (be it SPARC or x86 or IBM's mainframes soon...) as much as you want and of course is does support uncapped partitions (zones). You are not limited to just 10 of them - I know environments with many more than 10 and you can allocate 1/1000th of CPU or less if you want - basically you don't have any artificial or design limits as in AIX.

Then they move to networking and complain that in Solaris you have to edit text files... how funny is it? btw: if you really want you can use Webmin which is a GUI interface to manage an OS and it is delivered with Solaris - much easier than IBM's SMIT and it works on Linux and couple of other platforms too... or you can use Visual Panels in Open Solaris, still more powerful.

Later on they move to performance tuning and claim that using gazillions of different AIX commands is somehow easier than managing two txt files on Solaris... well... whatever. Of course performance tuning is all about observability because you need to first to understand what you want to tune, why and then measure the effect you changes will introduce. Solaris is currently the most observable production OS on the planet, mostly due to DTrace. AIX is far far behind in this respect.

Why is it so hard to find an article from IBM which at least tries to be objective? Why everything they do is an ultimate marketing machine? Maybe because it works in so many cases...

The plain truth is that AIX was one of the innovative UNIX flavours in a market but it stayed behind many years ago. And while they do try to catch up here and there they no longer lead the market and it is the dying platform. If it wasn't for a large legacy enterprise customer base it would have already share the fate of OS2. Sure there are some enthusiasts - they always are for any product but it doesn't change anything. The future seems to be with Linux, Open Solaris and Windows.

Thursday, May 28, 2009

Parallel Zones Patching in Solaris 10 U8

And if you want it before U8 shows up it should be delivered as a patch to patchadd late June.

update: looks like it will be released on 17.06.2009

Wednesday, May 27, 2009

L2ARC Turbo WarmUp

This has been integrated into snv_107.

"The L2ARC warms up at a maximum rate of l2arc_write_max per second,
which currently defaults to 8 Mbytes. This value was picked to minimise
the expense of both finding this data in the ARC, and writing it to current
read-bias SSDs - which maximises workload performance from warm L2ARC devices.

This value isn't ideal if the L2ARC devices are cold or cool - since we'd like
them to warm up as quick as possible. Also, since they are cold - there is
less concern for interfering with L2ARC hits by writing faster.
This also applies to shifts in workload, where the L2ARC is warm with content
that is no longer getting hits - and behaves as like it is cool.

This RFE is to dynamically increase the write rate when the current workload
isn't cached on the L2ARC (either because it is cold, or because the workload
has shifted); and to idle back the write rate when the L2ARC is warm."

Friday, May 15, 2009

Is it really a random load?

I'm running some benchmarks in a background and for some reason I wanted to verify if the workload filebench is generating is actually random withing a large file. Then the file is 100GB in size but workload is supposed to do random reads to only first 70GB of the file.

# dtrace -n io:::start'/args[2]->fi_name == "00000001"/ \
{@=lquantize(args[2]->fi_offset/(1024*1024*1024),0,100,10);}' \
-n tick-3s'{printa(@);}'
[...]
0 49035 :tick-3s

value ------------- Distribution ------------- count
0 | 0
0 |@@@@@@ 218788
10 |@@@@@@ 219156
20 |@@@@@@ 219233
30 |@@@@@@ 218420
40 |@@@@@@ 218628
50 |@@@@@@ 217932
60 |@@@@@@ 217572
70 | 0

So it is true on both cases - we can see evenly distributed access all over first 70GB of the file.

Thursday, May 14, 2009

Open Storage Wish List

1. L2ARC should survive reboots

There is already an RFE for this and AFAIK it's been working on.

2. Ability to mirror ARC between cluster nodes

In some workloads it is there important to be able to sustain a provided performance. To warm up a cache it could take even hours during which time the delivered performance could be lower than usual. #1 once implemented should partly fix the issue but still filling-in 128GB of cache could take some time and negatively impact the performance. I think the replication not necessarily would have to be a synchronous one. It probably would be hard to implement and maybe is not worth it...

3. L2ARC and SLOG SSDs shouldn't be included in disk drive IOPS stats.

While I haven't looked into it very closely it seems that when graphing IOPS for disk drives both L2ARC and SLOG numbers are included in totals. I think it is slightly misleading and a separate graph should be provided just for L2ARC and SLOG.

4. Ability to create a storage pool without L2ARC and/or SLOG devices even if they are present

While this is not necessarily important for production deployments it would help with testing/benchmarking so one doesn't have to physically remove SSDs in order to be able to build a pool without them.

Open Storage and Data Caching

I've been playing with Open Storage 7410 recently. Although I've been using its GUI for quite some time thanks to FishWorks beta program it still amazes me how good it is, especially when you compare it to NetApp or Data Domain.

One of the really good things about Open Storage is it allows for quite a lot of Read/Write cache (currently up-to 128GB). If it is still not enough you can put up to ~600GB of additional Read Cache in terms of SSDs. What it means in practice is that many real-life workloads will entirely fit into the cache which in turn will provide excellent performance. In a way this is nothing new except for... economics! Try to find any other NAS product in the market when you can put ~600GB of cache and within the same price range as Open Storage. You won't find anything like this.

I have created a disk pool out of 20x 1TB SATA disk drives which are protected with RAID-DP (akd RAIDZ2 which is an implementation of RAID-6). Now RAIDZ2 is known for a very bad random read performance from multiple streams if data is not cached. Using filebench I run a random read workload for a dataset of 10GB (let's say a small MySQL database) with 16 active streams. The 7410 appliance has been rebooted prior to test so all caches were clean. As you can see on below screenshot at the beginning it was able to sustain ~400 NFSv3 operations per second. After about 50 minutes it delivered ~12,000 NFSv3 operations per second which saturated my 1GbE link. At about the same time the average latency for NFS operations were getting smaller and smaller, same for number of operations to physical disks. At some point all data had been in cache and there were no operations to physical disks at all. B



The appliance could do certainly much more if I would use more GbE links of 10GbE links. Now remember that I used 20x 1TB SATA disk drives in a RAID-DP configuration to get this performance and it could sustain it for workloads of up to ~600GB of a working set size. If you put these numbers into perspective: one 15K FC disk drive can deliver ~250 8KB random reads at most. You would need almost 100 such disk drives configured in RAID-10 to be able to match the performance and still you would get less capacity (even assuming 300GB FC 15K drives).

Open Storage is the game changer for a lot of workloads both in terms of a delivered performance and a cost - currently there isn't really anything in the market which can match it.


Tuesday, May 12, 2009

Tuesday, May 05, 2009

GNU grep vs. grep

Open Solaris b111 x64
$ ptime /usr/gnu/bin/grep mysqld 1_4GB_txt_file

real 1:32.055472017
user 1:30.202692546
sys 0.907308690

$ ptime /usr/bin/grep mysqld

real 8.725173958
user 7.621411130
sys 1.056151347


I guess it's due to GNU version being compiled without optimizations... or maybe it is something else. Once I find some time I will try to investigate it.

Thursday, April 30, 2009

Solaris 10 5/09

Solaris 10 5/09 aka update 7 is out and available for download. Below are some new features from What's New I found interesting:

Support for Zone Cloning
If the source and the target zonepaths reside on ZFS and both are in the same pool, a snapshot of the source zonepath is taken and the zoneadm clone uses ZFS to clone the zone.

SunSSHWith OpenSSL PKCS#11 Engine Support
This feature enables the SunSSH server and client to use Solaris Cryptographic Framework through the OpenSSL PKCS#11 engine. SunSSH uses cryptographic framework for hardware crypto acceleration of symmetric crypto algorithms which is important to the data transfer speed. This feature is aimed at UltraSPARC® T2 processor platforms with n2cp(7D) crypto driver.

iSCSITarget
Several bug fixes and improvements. See the What's New for more details.

Solaris Power Aware Dispatcher and Deep C-State Support

Event driven CPU power management –On systems that supportDynamic Voltage and Frequency Scaling (DVFS) by Solaris, the kernel scheduler or dispatcher will schedule threads across the system's CPUs in a manner that coalesces load, and frees up other CPUs to be deeply power managed. CPU power state changes are triggered when the dispatcher recognizes that the utilization across a group of power manageable CPUs has changed in a significant way. This eliminates the need to periodically poll CPU utilizations across the system, and enables the system to save more power when CPUs are not used, while driving performance when CPUs are used. Event driven CPU power management is enabled by default on systems that supportDVFS. This feature can be disabled, or the legacy polling-based CPU power management can be used through the cpupm keyword in power.conf(4).


■ Support forDeep Idle CPU PowerManagement or deep C-state support on Intel Nehalem-based systems – The project also adds Solaris support forDeep C-states on Intel Nehalem-based systems. This support enables unused CPU resources to be dynamically placed in a state where they consume a fraction of the power consumed in their normal operating state. This feature also provides Solaris support for the power saving feature, as well as the policy implementation that decides when idle CPUs should request deep idle mode. This feature will be enabled by default where supported, and can be disabled through the cpu-deep-idle keyword in power.conf(4).


■ Observability for Intel's TurboMode feature – IntelNehalem-based systems have the ability to raise the operating frequency of a subset of the available cores when there is enough thermal headroom to do so. This ability temporarily boosts performance, but it is controlled by the hardware and transparent to software. Starting with the Solaris 10 5/09 release, a new kstat module observes when the system is entering the turbo mode and at which frequency it operates.

Wednesday, April 29, 2009

Some perspective to Suns Q3FY2009 numbers

Upgrading Open Storage 7410

Adam blogged about a new software release for the Open Storage. Because I'm testing 7410 model I went thru an upgrade this morning. First I downloaded the new image and I was impressed - I got 9-10MB/s and it looks like the bottleneck was my 100Mbs link to an office network - I never got such download rates from Sun before, excellent!
Then with just one mouse click I uploaded the new image onto the 7410 and after a short while it was listed as ready to be installed. So I did click on it and after about 25 minutes later it finished. Of course during the upgrade I could still use the appliance.

I really like the end-user experience - couple of mouse-clicks and you're done. That's the way it should be.

I did some testing with filebench before and after the upgrade and I'm really happy to share that I'm getting about 33% performance improvement for varmail workload with the new build. While your mileage may vary I think Sun should have highlighted performance improvements in the Release Notes.


What would really be useful for testing purposes is an ability to create a pool without L2ARC or SLOG devices - this would make life a little bit easier with testing when comparing configurations with and without SSDs. The default behaviour is excellent as it will pick up SSDs and propose most optimal use of them so end-user doesn't have to even understand how it works and how to configure them properly. Still I would like to have an option to not configure L2ARC or SLOG without having to physically pull them out.

Monday, April 20, 2009

Oracle Agrees to Acquire Sun Microsystems

This is a big surprise!

From Oracle's document on the acquisition:
• Protects and extends customers’ investment in Sun technologies
• Accelerate growth of Java as an open industry standard development platform
• Sustain Solaris as an industry standard OS for Oracle software
• Continue Open Storage and Systems focus and innovation
• Ensure continued innovation and investment in Java technology
• Optimize Solaris and Oracle for better performance, reliability, and manageability
• Protects massive customer investment in SPARC
• Open Storage built with industry standard servers and components

Sun's Official Announcement
Wall Street Journal

Monday, April 06, 2009

truss(1M) vs. dtrace(1M)

One of the many benefits of DTrace vs. truss is that dtrace should induce much smaller overhead for tracing applications especially for multi-threaded applications running on multi core/cpu servers. Lets put it to a quick test.

I quickly wrote a small C program which spawns N threads and each thread does stat("/tmp") X times. Then I measured how much time it takes to execute it for 1mln stat()'s in total while running with no tracing at all, running under truss and running under dtrace.


One two-core AMD CPU
# ptime ./threads-2 1 1000000

real 2.662809885
user 0.223471401
sys 2.435895135

# ptime ./threads-2 2 500000

real 1.649542016
user 0.226104849
sys 3.045784378

# ptime truss -t xstat -c ./threads-2 2 500000

syscall seconds calls errors
xstat 6.966 1000000
stat64 .000 3 1
-------- ------ ----
sys totals: 6.966 1000003 1
usr time: .776
elapsed: 18.520

real 18.533000528
user 5.677239771
sys 16.069020190

# dtrace -n 'syscall::xstat:entry{@=count();}' -c 'ptime ./threads-2 2 500000'
dtrace: description 'syscall::xstat:entry' matched 1 probe

real 1.888294217
user 0.225676973
sys 3.506004575
dtrace: pid 8526 has exited

1000000

truss made the program to execute about 11x longer while dtrace made program to execute for about 14% longer.


Niagara server:

# ptime ./threads-2 1 1000000

real 10.873
user 1.881
sys 8.992

# ptime ./threads-2 10 100000

real 1.467
user 1.962
sys 12.121

# ptime truss -t xstat -c ./threads-2 1 1000000

syscall seconds calls errors
stat 26.958 1000004 1
-------- ------ ----
sys totals: 26.958 1000004 1
usr time: 2.758
elapsed: 214.600

real 3:34.613
user 30.900
sys 2:28.182

# ptime truss -t xstat -c ./threads-2 10 100000

syscall seconds calls errors
stat 37.259 1000004 1
-------- ------ ----
sys totals: 37.259 1000004 1
usr time: 3.178
elapsed: 168.010

real 2:48.063
user 1:05.709
sys 3:35.813

# dtrace -n 'syscall::stat:entry{@=count();}' -c 'ptime ./threads-2 1 1000000'
dtrace: description 'syscall::stat:entry' matched 1 probe

real 14.028
user 1.957
sys 12.069
dtrace: pid 12920 has exited

1000939

# dtrace -n 'syscall::stat:entry{@=count();}' -c 'ptime ./threads-2 10 100000'
dtrace: description 'syscall::stat:entry' matched 1 probe

real 1.858
user 2.142
sys 15.632
dtrace: pid 11679 has exited

1000083

truss made the program to execute about 20x longer in the single thread case and 115x longer for the multi threaded one while dtrace added no more than 30% to the execution time regardless if the application was running with one or many executing threads. This shows that one has to be especially careful when using truss on a multi CPU/core system on a multi-threaded application. Notice that the performance difference between multi-threaded and single-threaded example for truss shows not that much difference comparing to execution times with no tracing at all which shows the ugly feature of truss - it serializes a multi-threaded application.

Of course the benchmark is the worst-case scenario and in real life you shouldn't get that much overhead from both tools. Still truss in some cases could introduce too much overhead on a production server while dtrace would still be perfectly acceptable allowing you to continue with your investigation.

btw: DTraceToolkit provides a script called dtruss - it's a tool similar to truss but it is using DTrace.



cat threads-2.c


#include <thread.h>
#include <stdlib.h>
#include <pthread.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>

void *thread_func(void *arg)
{
int *N=arg;
int i;
struct stat buf;

for (i=0; i<*N; i++)
stat("/tmp", &buf);

return(0);
}

int main(int argc, char **argv)
{
int N, iter;
int i;
int rc;
pthread_t tid[255];

if (argc != 3)
{
printf("%s number_of_threads number_of_iterations_per_thread\n", argv[0]);
exit(1);
}

N = atoi(argv[1]);
iter = atoi(argv[2]);

for (i=0; i<N; i++)
{
if (rc = pthread_create(&tid[i], NULL, thread_func, &iter))
printf("Thread #%d creation failed [%d]\n", i, rc);
}


/* wait for all threads to complete */
for (i=0; i<N; i++)
pthread_join(tid[i], NULL);

exit(0);
}

Tuesday, March 31, 2009

ZFS Deduplicatuion This Summer?

Jeff Bonwick wrote:
"Yes -- dedup is my (and Bill's) current project. Prototyped in December.
Integration this summer. I'll blog all the details when we integrate,
but it's what you'd expect of ZFS dedup -- synchronous, no limits, etc."

The CPU Overclocks itself

Joerg reports:
"With the announcement of Intel Nehalem support in Solaris, we pointed to some interesting features, but from my perspective the power-aware dispatcher is the most interesting one. I wrote a while ago about the turbo boost feature of the Nehalem processors. The processor overclocks itself, when there is still head room in the power and thermal budget. It can overclock a core even higher, when other cores are in deep sleep. Otherwise it can make sense not to use a core for a single process, when there is enough compute power available otherwise you could put this core into a deep sleep mode just to save power. The new power-aware dispatcher in Solaris is aware of this side conditions and can dispatch the processes in a System accordingly. You will find more informations at the projects website."

Thursday, March 26, 2009

Trying too hard

From time to time I can see people trying to be too clever about some problems. What I mean by that is that sometimes they try too hard to use latest technologies to do something while there is already a solution which does the job. Or sometimes instead of taking a step back and taking a deep breath they dive directly into problem solving coming up with crazy ways to accomplish something. I guess it happens to all of us from time to time. This time it happened to me :) :)

A colleague approached me with a problem he had on some old Solaris 7 server which is stripped and customized and there is no pargs command there. He needed to get a full argument list of a running process but ps truncate it to 80 characters. Well I thought a simple C program should be able to extract the information via /proc. So me trying to be helpful I started to write it right a way. After some time I came up with:


bash-2.05# cat pargs.c

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <procfs.h>
#include <sys/procfs.h>
#include <sys/prsystm.h>



int main(int argc, char *argv[])
{
psinfo_t p;
char *file;
int fd;
int fd_as;
uintptr_t pargv[1024];
char arg[1024];
int i;

if(argc != 3)
{
printf("Usage: %s /proc/PID/psinfo\n", argv[0]);
exit(1);
}

file = argv[1];
fd = open(file, O_RDONLY);
if (fd == -1)
{
printf("Can't open %s file\n", file);
exit(2);
}

read(fd, &p, sizeof(p));
close(fd);

fd_as = open(argv[2], O_RDONLY);

printf("nlwp: %d\n", p.pr_nlwp);
printf("exec: %s\n", p.pr_fname);
printf("args: %s\n", p.pr_psargs);
printf("argc: %d\n", p.pr_argc);

pread(fd_as, &pargv, p.pr_argc * sizeof (uintptr_t), p.pr_argv);
for (i=0; i<p.pr_argc; i++)
{
pread(fd_as, &arg, 256, ((uintptr_t *)pargv)[i]);
printf(" %s\n", arg);
}

close(fd_as);
exit(0);
}



Job done.
Well couple of minutes later I realized that UCB version of ps is able to show long argument list...


bash-2.05# /usr/ucb/ps -axuww |grep "19179"
XXXX 19179 9.3 2.23998422056 ? S 11:02:30 0:02 /usr/java/bin/../bin/sparc/native_threads/java -classpath :./classes/packages/jakarta-regexp-1.3.jar:./classes/packages/classes12.zip:./classes/packages/mail.jar:./classes/packages/activation.jar:./classes/ MailSender
bash-2.05#


I had a good laugh at myself.

Tuesday, March 24, 2009

Library Interposer

Recently I have used Dtrace to change the output of uname() syscall. But if one wants a more permanent and selective approach it is easier to write a small library which would interpose the uname() syscall (well, actually uname() libC function and not a syscall itself). I slightly modified the malloc_interposer example.

After you compiled the library all you have to do is to LD_PRELOAD it in your script so everything started by that script will use it or you can LD_PRELOAD it only for a given binary as shown below. Additionally you have to set a variable uname_release to whatever string you like otherwise the library won't do anything.

# uname -a
SunOS test-server 5.10 Generic_125100-10 sun4u sparc SUNW,Sun-Fire-V440
#
# uname_release="5.7" LD_PRELOAD=./uname_interposer.so uname -a
SunOS test-server 5.7 Generic_125100-10 sun4u sparc SUNW,Sun-Fire-V440



# cat uname_interposer.c
/* Based on http://developers.sun.com/solaris/articles/lib_interposers_code.html#malloc_interposer.c
*/
/* Example of a library interposer: interpose on
* uname().
* Build and use this interposer as following:
* cc -o malloc_interposer.so -G -Kpic malloc_interposer.c
* setenv LD_PRELOAD $cwd/uname_interposer.so
* run the app
* unsetenv LD_PRELOAD
*/

#include <stdio.h>
#include <dlfcn.h>
#include <stdlib.h>

#include <sys/utsname.h>

int uname(struct utsname *name)
{
int rc;
char *release;

static int (*uname_func)(struct utsname *) = NULL;
if(!uname_func)
uname_func = (int (*)(struct utsname*)) dlsym(RTLD_NEXT, "uname");
rc = uname_func(name);
if (release=getenv("uname_release"))
strlcpy(name->release, release, _SYS_NMLN);

return(rc);
}
#

# gcc -fPIC -g -o uname_interposer.so -G uname_interposer.c

Data Center in a Desert?

My friend ask me why build a Mega Data Center in a desert? Well, when you put a lot of CoolThreads servers you need to somehow keep them warm :) :) :)

Thursday, March 12, 2009

When Free is Too Expensive

I like Jonathan Schwartz blog entries and his last post he clarifies Sun's business model. I like the funny part about free software - how true it is.
"When Free is Too Expensive
One of my favorite customer stories relates to an American company that did nearly 30% of its yearly revenue on Christmas Day. They were a mobile phone company, whose handsets appeared under Christmas trees, opened en masse and provisioned on the internet within about a 48 hour period. When we won the bid to supply their datacenter, their CIO gave me the purchase order on the condition I gave him my home phone number. He said, "If I have any issues on Christmas, I want you on the phone making sure every resource available is solving the problem." I happily provided it (and then made sure I had my direct staff's home numbers). Christmas came and went, no problems at all.

A year later, he was issuing a purchase order to Sun for several of our software products. To have a little fun with him (and the Sun sales rep), I told him before he passed me the purchase order that the products were all open source, freely available for download.

He looked at me, then at his rep, and said "What? Then why am I paying you a million dollars?" I responded, "You can absolutely run it for free. You just can't call me on Christmas day, you'll be on your own." He gave me the PO. At the scale he was running, the cost of downtime dwarfed the cost of the license and support.

Numerically, most developers and technology users have more time than money. Most readers of this blog are happy to run unsupported software, and we are very happy to supply it. For a far smaller population, the price of downtime radically exceeds the price of a license or support - for some, the cost of downtime is measured in millions per minute. If you're tracking packages or fleets of aircraft, running an emergency response networking or a trading floor, you almost always have more money than time. And that's our business model, we offer utterly exceptional service, support and enterprise technologies to those that have more money than time. It's a good business."

Saturday, March 07, 2009

Open Storage - What's Next?

If you wonder what's coming in storage area and also in ZFS in particular watch Open Solaris Storage Summit. To get your attention here is a list of some really exiting features coming to ZFS:

  • DeDuplication in ZFS
  • User Quotas in ZFS
  • Disk Eviction/Pool Shrinking
  • VSS Shadow Copies with ZFS Snapshots
  • Persistent L2ARC
  • ZFS Encryption
  • Lustre + ZFS
  • pNFS + ZFS

Wednesday, March 04, 2009

Oracle 8.0.6 on Solaris 10

I'm working on getting Oracle 8.0.6 32bit running on Solaris 7 migrated to Solaris 10. There is no branded zone for Solaris 7 and we have decided to try to run Oracle 8.0.6 directly on Solaris 10. Basically it just works. Basically... the problem was that some of a database files are larger than 2GB and Oracle fails to recover database on these files. After checking some log files and a little bit of dtrace'ing I found out that it does a stat() syscall on each db file before recovery starts and stat() fails with EOVERFLOW. So it uses wrong API... but it seems to work fine on Solaris 7 with the same binaries. It turned out that while Oracle is starting it is calling uname() to determine an OS version and based on that information it can change its behavior (like not using proper API to access large files). The easiest way is to use dtrace to intercept uname() syscall and put a fake output just before it returns. After that everything seems to be working fine.

Below dtrace script will put "5.7" string in uname() structure for every application calling uname() with uid=300 (oracle in my case). One might also write a small interposing library and LD_PRELOAD it while starting Oracle - that should also work.

#!/usr/sbin/dtrace -qs

#pragma D option destructive

syscall::uname:entry
/uid==300/
{
self->addr = arg0;
}

syscall::uname:return
/self->addr/
{
copyoutstr("5.7", self->addr+(257*2), 257);
}

Tuesday, February 24, 2009

ZFS in the Trenches

Very good presentation by Ben Rockwood - excellent one to start with ZFS tuning.

Monday, February 23, 2009

BART on large file systems

I wanted to use bart(1M) to quickly compare contents of two file systems. But it didn't work...

# bart create /some/filesystem/
! Version 1.0
! Monday, February 23, 2009 (11:53:57)
# Format:
#fname D size mode acl dirmtime uid gid
#fname P size mode acl mtime uid gid
#fname S size mode acl mtime uid gid
#fname F size mode acl mtime uid gid contents
#fname L size mode acl lnmtime uid gid dest
#fname B size mode acl mtime uid gid devnode
#fname C size mode acl mtime uid gid devnode
#

And it simply exits.

# truss bart create /some/filesystem/
[...]
statvfs("/some/filesystem", 0x08086E48) Err#79 EOVERFLOW
[...]
#

It should probably be statvfs64() but lets check what's going on.

# cat statvfs.d
#!/usr/sbin/dtrace -Fs

syscall::statvfs:entry
/execname == "bart"/
{
self->on=1;
}

syscall::statvfs:return
/self->on/
{
self->on=0;
}

fbt:::entry
/self->on/
{
}

fbt:::return
/self->on/
{
trace(arg1);
}

# ./statvfs.d >/tmp/a &
# bart create /some/filesystem/
# cat /tmp/a
CPU FUNCTION
2 -> statvfs32
[...]
2 -> cstatvfs32
2 -> fsop_statfs
[...]
2 <- fsop_statfs 0 2 <- cstatvfs32 79 [...] 2 <- statvfs32 79 #

It should have used statvfs64() which should have used cstat64_32()

Seems like /usr/bin/bart is a 32bit binary compiled without largefile(5) awareness. There is a bug for exactly the case already opened - 6436517.

While DTrace wasn't really necessary here it helped me to very quickly see what is actually going on in kernel and why it fails especially with a glance at the source thanks to OS OpenGrok. It helped to find the bug.

ps. the workaround in my case is to temporarily set a zfs/fs quota on /some/filesystem to 1TB - and then it works.