Wednesday, December 20, 2006

Solaris 10 users worlwide

See this map - cool. From Jonathan blog entry:
Each pink dot represents a connected Solaris 10 user - not a downloader, but an individual or machine (independent of who made the server) that connects back to Sun's free update service for revisions and patches - applied to an individual machine, or a global datacenter. This doesn't yet account for anywhere near all Solaris 10 downloads, as most administrators still choose to manage their updates through legacy, non-connected tools. But it's directionally interesting - and shows the value of leveraging the internet to meet customers (new and old).

Sun Cluster 3.2

Finally Sun Cluster 3.2 is out. You can download it here for free. Just to highlight it - SC3.2 supports ZFS so you can for example build HA-NFS with ZFS and it works like a charm - I've been running such configs for months now (with SC3.2 beta). Also I know people generally don't like to learn new CLIs but in case of new SC it's worth it - imho it's much nicer. Additionally thanks to Quorum server it's now possible to setup cluster without shared storage - could be useful sometimes. Documentation is available here.

NEW FEATURES

Ease of Use
* New Command Line Interfaces
* Oracle 10g improved integration and administration
* Agent configuration wizards
* Flexible IP address scheme

Higher Availability
* Cluster support for SMF services
* Quorum server
* Extended flexibility for fencing protocol
* Greater Flexibility
* Expanded support for Solaris Containers
* HA ZFS - agent support for Sun's new file system
* Extended support for Veritas software components

Better Operations and Administration
* Dual-partition software update
* Live upgrade
* Optional GUI installation

With Solaris Cluster Geographic Edition, new features include:
* Support for x64 platforms
* Support for EMC SRDF replication software

Solaris Cluster is supported on Solaris 9 9/05 and Solaris 10 11/06.

Tuesday, December 19, 2006

Sun Download Manager


Recently I've noticed in Sun Download Center that I can download files the old way using save as in a browser or I can use Sun Download Manager directly from web page as JavaWS - I tried it and I must say I really like it - you just check which files you want to download and start SDM (from web page) and files are immediately being downloaded. It offers retries, continue of retrieval not completely downloaded files, automatically unzipping zipped files, proxy servers. All of it is configurable of course.

However I have my wish list for SDM:

  • ability to download files in parallel (configurable how many streams)
  • ability to not only unzip files but also to automatically merge them (great for Solaris and/or SX downloads)
  • option to ask for download directory when new downloads are being added

Saturday, December 16, 2006

LISA - follow up

It was my first time at LISA conference and I must say I really enjoyed it. There were a lot of people (over 1100 according), almost all sessions I attended to were really good. Not all of them were strictly technical but they were both humorous and informative. I had also opportunity to talk to other admins from large data centers which is always great as you can verify what other smart people are doing in their environments, often much larger than yours, and compare to what you are doing. It's always good to see and hear what other smart people have to say. I hope I'll go to LISA next year :)

So after my short vacations and attending to LISA I'm full of energy :) Well, me and Andrzej decided to start thinking about next Unix Days. I guess I'll write something more about it later.

Availability Suite goes into Open Solaris

I was going thru several OpenSolaris mailing groups and spotted really great news on storage-discuss list - entire Availability Suite is going to be integrated into Open Solaris next month! It means it will be free of charge, source will be available, etc. For most people it means mature solution for remote replication (synchronous and asynchronous) on a block level. Below quoted post:

"[...]
As the Availability Suite Project & Technical Lead, I will take this
opportunity to say that in January '07, all of the Sun StorageTech
Availability Suite (AVS) software is going into OpenSolaris!

This will include both the Remote Mirror (SNDR) and Point-in-Time Copy
(II) software, which runs on OpenSolaris supported hardware platforms of
SPARC, x86 and x64.

AVS, being both file system and storage agnostic, makes AVS very capable
of replicating and/or taking snapshots of UFS, QFS, VxFS, ZFS, Solaris
support databases (Oracle, Sybase, etc.), contained on any of the
following types of storage: LUNs, SVM & VxVM volumes, lofi devices, even
ZFS's zvols. [...]"

"[...]
The SNDR portion of Availability Suite, is very capable of replicating
ZFS. Due to the nature of ZFS itself, the unit of replication or
snapshot is a ZFS storage pool, not a ZFS file system. The relationship
between the number of file systems in each storage pools is left to the
discretion of the system administrator, being 1-to-1 (like older file
systems), or many-to-1 (as is now possible with ZFS).

SNDR can replicate any number of ZFS storage pools, where each of the
vdevs in the storage pool (zpool status ), must be configured
under a single SNDR I/O consistency group. Once configured, the
replication of ZFS, like all other Solaris supported file systems, works
with both synchronous and asynchronous replication, the latter using
either memory queues or disks queues.

This product set is well documented and can seen at
http://docs.sun.com/app/docs?p=coll%2FAVS4.0
The current release notes for AVS 4.0 are located at
http://docs.sun.com/source/819-6152-10/AVS_40_Release_Notes.html

More details will be forthcoming in January, so please keep a look out
for Sun StorageTech Availability Suite in 2007![...]"


Entire thread here.

Monday, December 11, 2006

Solaris 10 11/06 (update 3)

Finally long awaited update 3 of Solaris 10. I've just put it on x4500 box.

Friday, November 17, 2006

Vacation

I'm leaving for vacation finally :) Then directly from my vacation I go to LISA Tech Days, so see you there.

Thursday, November 16, 2006

ZFS RAID-Z2 Performance

While ZFS's RAID-Z2 can offer actually worse random read performance than HW RAID-5 it should offer much better write performance than HW RAID-5 especially when you are doing random writes or you are writing to lot of different files concurrently. After doing some tests I happily found it exactly works that way as expected. Now the hard question was: would RAID-Z2 be good enough in terms of performance in actual production environment? There's no simple answer as in a production we do actually see a mix of reads and writes. With HW RAID-5 when your write throughput is large enough its write cache can't help much and your write performance falls down dramatically with random writes. Also one write IO to an array is converted to several IOs - so you get less available IOs left for reads. ZFS RAID-Z and RAID-Z2 don't behave that way and give you excellent write performance whether it's random or not. It should also generate less write IOs per disk than HW RAID-5. So the true question is: will it offset enough to get better overall performance on a production?

After some testing I wasn't really closer to answer that question - so I decided on a pool configuration and other details and decided to put it in a production. The business comparison is that I need at least 2 HW RAID-5 arrays to carry our production traffic. One array just can't do it and main problem are writes. Well, only one x4500 with RAID-Z2 seems to do its job in the same environment without any problems - at least so far. It'll be interesting to see how it will behave with more and more data on it (only few TB's right now) as it will also mean more reads. But from what I've seen so far I'm optimistic.

ZFS RAID-Z2 Performance

While ZFS's RAID-Z2 can offer actually worse random read performance than HW RAID-5 it should offer much better write performance than HW RAID-5 especially when you are doing random writes or you are writing to lot of different files concurrently. After doing some tests I happily found it exactly works that way as expected. Now the hard question was: would RAID-Z2 be good enough in terms of performance in actual production environment? There's no simple answer as in a production we do actually see a mix of reads and writes. With HW RAID-5 when your write throughput is large enough its write cache can't help much and your write performance falls down dramatically with random writes. Also one write IO to an array is converted to several IOs - so you get less available IOs left for reads. ZFS RAID-Z and RAID-Z2 don't behave that way and give you excellent write performance whether it's random or not. It should also generate less write IOs per disk than HW RAID-5. So the true question is: will it offset enough to get better overall performance on a production?

After some testing I wasn't really closer to answer that question - so I decided on a pool configuration and other details and decided to put it in a production. The business comparison is that I need at least 2 HW RAID-5 arrays to carry our production traffic. One array just can't do it and main problem are writes. Well, only one x4500 with RAID-Z2 seems to do its job in the same environment without any problems - at least so far. It'll be interesting to see how it will behave with more and more data on it (only few TB's right now) as it will also mean more reads. But from what I've seen so far I'm optimistic.

Tuesday, November 14, 2006

Caiman

If you install Solaris on servers using jumpstart then you never actually see Solaris interactive installer. But more and more people are using Solaris on their desktops and laptops and often installer is their first contact with Solaris. And I must admit it's not a good one. Fortunately Sun realizes that and some time ago project Caiman was started to address this problem. See Caiman Architecture document and Install Strategy document. Also see early propositions of gui Caiman installer.

Friday, November 10, 2006

ZFS tuning

Recently 6472021 was integrated. If you want to tune ZFS here you can get a list of tunables. Some default values for tunables with short comments can be find here, here, and here.

St Paul Blade - Niagara blade from Sun in Q1/07

I was looking thru latest changes to Open Solaris and found this:


Date: Mon, 30 Oct 2006 19:45:33 -0800
From: Venkat Kondaveeti
To: onnv-gate at onnv dot sfbay dot sun dot com, on-all at sun dot com
Subject: Heads-up:St Paul platform support in Nevada

Today's putback for the following
PSARC 2006/575 St Paul Platform Software Support
6472061 Solaris support for St Paul platform

provides the St Paul Blade platform support in Nevada.
uname -i O/P for St Paul platform is SUNW,Sun-Blade-T6300.

The CRs aganist Solaris for St Paul Blade platform support
should be filed under platform-sw/stpaul/solaris-kernel in bugster.

If you're changing sun4v or Fire code, you'll want to test on St Paul.
You can get hold of one by contacting stpaul_sw at sun dot com alias with
"Subject: Need St Paul System Access" and blades
will be delivered to ON PIT and ON Dev on or about Feb'8th,2007.

St Paul eng team will provide the technical support.
Please send email to stpaul_sw at sun dot com if any issues.

FYI, StPaul is a Niagara-1 based, 1P, blade server designed exclusively
for use in the Constellation chassis (C-10). The blades are comprised of
an enclosed motherboard that hosts 1 system processor, 1 FIRE ASIC, 8
DIMMS,
4 disks, 2 10/100/1000Mbps Ethernet ports, 2 USB 2.0 ports and a Service
processor. Power supplies, fans and IO slots do not reside on the blade,
but instead exist as part of the C-10 chassis. Much of the blade design is
highly leveraged from the Ontario platform. St Paul RR date per plan is
03/2007.

Thanks

St Paul SW Development Team

Wednesday, November 08, 2006

ZFS saved our data

Recently we migrated Linux NFS server to Solaris 10 NFS server with Sun Cluster 3.2 and ZFS. System has connected 2 SCSI JBODs and each node has 2 SCSI adapters, RAID-10 between JBODs and SCSI adapters was created using ZFS. We did use rsync to migrate data. During migration we noticed in system logs that one of SCSI adapters reported some warnings from time to time. Then more serious warnings about bad firmware or broken adapter - but data kept writing. When we run rsync again ZFS reported some checksum errors but only on disks which were connected to bad adapter. I run scrub on entire pool and ZFS reported and corrected thousands of checksum errors - all of them on a bad controller. We removed bad controller and reconnected JBOD to good one, run scrub again - this time no errors. Then we completed data migration. So far everything works ok and no checksum error are reported by ZFS.

Important thing here is that ZFS detected that bad SCSI adapter was actually corrupting data and ZFS was able to correct that on-the-fly so we didn't have to start from the beginning. Also if it was classic file system we probably wouldn't have even notice that our data were corrupted until system panic or fsck needed. Also as there were so many errors probably fsck wouldn't help for file system consistency not to mention that it wouldn't correct bad data at all.

Friday, November 03, 2006

Thumper throughput

For some testing I'm creating right now 8 raid-5 devices under SVM with 128k interleave size. It's really amazing how much x4500 server can do in terms of throughput. Right now all those raid-5 volumes are generating above 2GB/s write throughput! Woooha! It can write more data to disks than most (all?) Intel servers can read or write to memory :))))


bash-3.00# metainit d101 -r c0t0d0s0 c1t0d0s0 c4t0d0s0 c6t0d0s0 c7t0d0s0 -i 128k
d101: RAID is setup
bash-3.00# metainit d102 -r c0t1d0s0 c1t1d0s0 c5t1d0s0 c6t1d0s0 c7t1d0s0 -i 128k
d102: RAID is setup
bash-3.00# metainit d103 -r c0t2d0s0 c1t2d0s0 c5t2d0s0 c6t2d0s0 c7t2d0s0 -i 128k
d103: RAID is setup
bash-3.00# metainit d104 -r c0t4d0s0 c1t4d0s0 c4t4d0s0 c6t4d0s0 c7t4d0s0 -i 128k
d104: RAID is setup
bash-3.00# metainit d105 -r c0t3d0s0 c1t3d0s0 c4t3d0s0 c5t3d0s0 c6t3d0s0 c7t3d0s0 -i 128k
d105: RAID is setup
bash-3.00# metainit d106 -r c0t5d0s0 c1t5d0s0 c4t5d0s0 c5t5d0s0 c6t5d0s0 c7t5d0s0 -i 128k
d106: RAID is setup
bash-3.00# metainit d107 -r c0t6d0s0 c1t6d0s0 c4t6d0s0 c5t6d0s0 c6t6d0s0 c7t6d0s0 -i 128k
d107: RAID is setup
bash-3.00# metainit d108 -r c0t7d0s0 c1t7d0s0 c4t7d0s0 c5t7d0s0 c6t7d0s0 c7t7d0s0 -i 128k
d108: RAID is setup
bash-3.00#


bash-3.00# iostat -xnzCM 1 | egrep "device| c[0-7]$"
[omitted first output as it's avarage since reboot]
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 367.5 0.0 367.5 0.0 8.0 0.0 21.7 0 798 c0
0.0 389.5 0.0 389.5 0.0 8.0 0.0 20.5 0 798 c1
0.0 276.4 0.0 276.4 0.0 6.0 0.0 21.7 0 599 c4
5.0 258.4 0.0 258.4 0.0 6.0 0.0 22.9 0 602 c5
0.0 394.5 0.0 394.5 0.0 8.0 0.0 20.2 0 798 c6
0.0 396.5 0.0 396.5 0.0 8.0 0.0 20.1 0 798 c7
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 376.0 0.0 376.0 0.0 8.0 0.0 21.2 0 798 c0
0.0 390.0 0.0 390.0 0.0 8.0 0.0 20.5 0 798 c1
0.0 281.0 0.0 281.0 0.0 6.0 0.0 21.3 0 599 c4
0.0 250.0 0.0 250.0 0.0 6.0 0.0 24.0 0 599 c5
0.0 392.0 0.0 392.0 0.0 8.0 0.0 20.4 0 798 c6
0.0 386.0 0.0 386.0 0.0 8.0 0.0 20.7 0 798 c7
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 375.0 0.0 375.0 0.0 8.0 0.0 21.3 0 798 c0
0.0 407.0 0.0 407.0 0.0 8.0 0.0 19.6 0 798 c1
0.0 275.0 0.0 275.0 0.0 6.0 0.0 21.8 0 599 c4
0.0 247.0 0.0 247.0 0.0 6.0 0.0 24.2 0 599 c5
0.0 388.0 0.0 388.0 0.0 8.0 0.0 20.6 0 798 c6
0.0 382.0 0.0 382.0 0.0 8.0 0.0 20.9 0 798 c7
^C
bash-3.00# bc
376.0+390.0+281.0+250.0+392.0+386.0
2075.0

Monday, October 30, 2006

Thumpers - first impression

First Thumper is on for about 30 minutes :) There's S10 06/06 and 01/06 pre-installed. I chose 06/06 to boot, basic questions from sysidtool (locale, terminal type, etc.) - all on serial console by default - great. Then reboot I logged in on console and - 2x285 Opterons, 16GB RAM, 48x disks :) df -h and... by default there's already configured one pool named zpool1 which is collection of many smaller raidz groups. Well, first thing I just had to do was to run simple dd on that pool - 600-800MB/s of write performance and similar when reading :)
I recreated using the same disk for a one large stripe and got with single dd 1.35GB/s :) Not bad :)))))

Thumpers arrived!

Finally!

ps. Sun should definitely do something with their logistics - lately it just doesn't work the way it should.

Jim Mauro in Poland

This news is mostly for people in Poland so the rest of this entry will be in Polish.

W dniu 17 listopada (piatek) w godzinach 10:00-12:30 Jim wyglosi w biurze Sun Microsystems w Warszawie, ul. Hankiewicza 2, prezentacje:

Solaris 10 Performance, Observability and Debugging Tools and Techniques in Solaris 10

Prezentacja adresowana jest do administratoriw systemu, ekspertow od systemu operacyjnego Solaris i zagadnien wydajnosciowych, programistow oraz deweloperow aplikacji.

Uwaga - wymagana jest rejestracja za pomoca e-maila. Szczegoly w oficjalnym zaproszeniu.
Oficjalne zaproszenie znajdziecie tutaj.

Jim Mauro jest wspolautorem pierwszej ksiazki Solaris Internals, jak i dwoch najnowszych Solaris Internals Second Edition, Solaris Performance and Tools - kazdy kto powaznie zajmuje sie administracja systemem i/lub tuningiem aplikacji i systemu powinien te ksiazki miec w swojej biblioteczce.

Informacja o prezentacji jest rowniez na blogu Roberta Prusa, ktory to jest wspolorganizatorem spotkania.


ps. z nieoficjalnych informacji, byc moze uda sie rozlosowac powyzsze ksiazki w srod uczestnikow
ps2. jeszcze nie wiem czy bede

Tuesday, October 24, 2006

IP Instances

Sometimes it would be really useful to have separate routing table for a zone or being able to easily dedicate whole NIC to a zone (without assigning IP address or routing in a global zone). Well, there's actually a project call IP Instances which is going to address all these issues and more. If you want to know more about it then read here and here.

Tuesday, October 17, 2006

Dell RAID controllers

I've been playing with Dell PERC 3/DC RAID controller in v20z server + ZFS. There's external JBOD connected on one channel with several disks in it. I decided to configure each disk as an RAID-0 so basically I got 1-1 mapping. Then using ZFS I created a pool in a RAID-10 config. It turned out that creating RAID-10 entirely in HW doesn't offer any better performance so doing it in ZFS has two main advantages - one is I got much better data availability due to ZFS checksumming and second advantage is that I can connect that JBOD to simple SCSI controller and just import ZFS pool. I have actually checked it by pulling out a disk with a ZFS pool on it and then putting the disk in another JBOD with SPARC server and simple SCSI adapter and I was able to import that pool without any problems.

Now performance - it turned out that if I disable Read-Ahead and set Write-Thru mode with stripe size set to 128k I get best performance. Any more logic enabled on controller and performance drops. So in an essence I treat that RAID controller as a SCSI card (I would use SCSI card but I do not have one free right now).

By default Solaris recognizes PERC 3/DC controller out of the box. However you can't change any RAID parameters from within Solaris - you have to reboot server and go into RAID BIOS which is really annoying especially if you want to tweak parameters and see performance changes - it just takes lot of time. Additionally with setup described above (1-1 mapping, no redundancy in HW) if one disk fails you can replace it with another one (I dedicated one disk a hotspare) but you can't replace failed disk online 'coz in such a config you have to into RAID BIOS and tell it you have replaced disk. There's however another solution - Solaris used amr(7D) driver by default with PERC 3/DC. If you tell Solaris to use lsimega(7D) driver instead (which is also delivered with Solaris) then you can use LSIutils package which gives you monitoring and what is most important it gives you ability to reconfigure RAID card on-the-fly from within Solaris. The utility just works.

All I had to do in order to force system to use lsimega driver instead of amr was to comment out a line 'amr "pci1028,493"' in /etc/driver_aliases and add a line 'lsimega "pci1028,493"'. Then reboot a server (with reconfiguration). The hard part was to find LSIutil package - thanks to Mike Riley I downloaded it from LSI-Sun web page.

ps. it's great that Solaris has more and more drivers out-of-the box. I belive it would be even better if Sun would include utilities like LSIutil also.

Project Blackbox



According to story on The Register it has 250 Niagara boxes or x64 boxes and is cooled with water. There's also story on CNET.

update: it's official.

Friday, October 13, 2006

SecondLife on Solaris

Sun recently held a conference in SecondLife. Unfortunately there's no Solaris client - only Windows, Mac and Linux. Well, snv+49 has BrandZ, right? I downloaded Linux client logged to Linux zone and just started - it works very good and very fast! So all Solaris users - if you want to join Second Life go BrandZ :)

You can find Conference location here.

ps. Definitely forward over tcp to your X server in global zone rather than ssh x11 forwarding.

Thursday, October 12, 2006

BrandZ integrated into snv_49

Hi.

BrandZ project is finally integrated into snv_49. For all of you who don't know it means you can install Linux distribution in a Solaris Zone and run Linux applications. Those Linux applications run under Solaris kernel which means you can use Solaris Resource Manager, DTrace, ZFS, etc. For more details see BrandZ Overview Presentation. BrandZ are expected to be in Solaris 10 Update 4 next year. Right now you can get it with Solaris Express Community Edition and soon with Solaris Express.

Is it hard to install Linux in a zone? Well, below you can see what I did - create a Linux Zone with networking and audio device present for Linux. Basically it's just two commands!

# zonecfg -z linux
linux: No such zone configured

Use 'create' to begin configuring a new zone.

zonecfg:linux> create -t SUNWlx

zonecfg:linux> set zonepath=/home/zones/linux

zonecfg:linux> add net

zonecfg:linux:net> set address=192.168.1.10/24

zonecfg:linux:net> set physical=bge0

zonecfg:linux:net> end

zonecfg:linux> add attr

zonecfg:linux:attr> set name="audio"

zonecfg:linux:attr> set type=boolean

zonecfg:linux:attr> set value=true

zonecfg:linux:attr> end

zonecfg:linux> exit

# zoneadm -z linux install -d /mnt/iso/centos_fs_image.tar.bz2
A ZFS file system has been created for this zone.

Installing zone 'linux' at root directory '/home/zones/linux'

from archive '/mnt/iso/centos_fs_image.tar.bz2'


This process may take several minutes.


Setting up the initial lx brand environment.

System configuration modifications complete!

Setting up the initial lx brand environment.

System configuration modifications complete!

Installation of zone 'linux' completed successfully.

Details saved to log file:

"/home/zones/linux/root/var/log/linux.install.10064.log"


#

Thursday, October 05, 2006

How to fork/exec in an efficient way

I've just found good article on developers.sun.com about "Minimizing Memory Usage for creating Application Subprocesses" - it explains why posix_spawn() interface introduced in Solaris 10 is better than fork/exec approach many developers still use. There's also good section on Memory Overcommiting.

SMF management for normal users

Let's say you have your in-house developed applications under SMF. Now you want to give some users (non-root) ability to restart those applications but only those applications. You do not want to give any other privileges. With SMF it's really easy. You can do it per application instance or for entire group, etc.

1. Add new authorization

# grep ^wp /etc/security/auth_attr wp.applications:::Manage WP applications::

2. Add new property to each SMF service you want to give access to restart/enable/disable

# svccfg -s wpfileback setprop general/action_authorization = astring: wp.applications

With only that property user won't be able to change service status permanently - he/she will be able to
restart or temporarily disable/enable given service (wpfileback in above example). If you want to give
ability to permanently change service status you also need to add:

# svccfg -s wpfileback setprop general/value_authorization = astring: wp.applications

3. Add new authorization to user

# usermod -A wp.applications operator

You can also manually add authorization by editing /etc/user_attr. After above command the fill is:
# grep operator /etc/user_attr operator::::type=normal;auths=wp.applications


Now if you login as user operator you will be able to disable/enable/restart application wpfileback.

Additionally it's useful to give for example developers not only ability to restart their application but also to use dtrace. In order to achieve it add two privileges to user.

# grep operator /etc/user_attr
operator::::type=normal;auths=wp.applications;defaultpriv=basic,dtrace_proc,dtrace_user


Now user operator not only can restart/stop/start its application but also can use dtrace to find problems.

There're also many authorizations and profiles which come with Solaris by default. For example if you add for given user profile 'Service Operator' then you give ability to restart/enable/disable all SMF applications.

All of these possibilities without giving her/him root account.

For more information see smf_security(5) rbac(5) privileges(5)

Wednesday, October 04, 2006

Thumpers are coming...

or not?

Ehhhh... great technology which you can order and then wait, and wait, and wait, and... nobody can actually tell you for sure how much longer you have to wait. So we're waiting...

Thursday, September 21, 2006

ZFS in High Availability Environments

I see that many people are asking about ZFS + Sun Cluster solution. Soon Sun Cluster 3.2 should be released which does support ZFS (among many other new features). Now Solaris 10 is free, Sun Cluster is free. Additionally to install Sun Cluster it's just some clicks in GUI installer and voila! Then some other commands and we have ZFS pool under Sun Cluster management.
Below example (using new SC32 commands, old one are also available for backward compatibility) how to configure 2-node HA-NFS cluster with ZFS - as you can see it's really quick&easy.



Nodes: nfs-1 nfs-2
ZFS pool: files

# clresourcegroup create -n nfs-1,nfs-2 -p Pathprefix=/files/conf/ nfs-files
# clreslogicalhostname create -g nfs-files -h nfs-1 nfs-files-net
# clresourcetype register SUNW.HAStoragePlus
# clresource create -g nfs-files -t SUNW.HAStoragePlus -x Zpools=files nfs-files-hastp
# clresourcegroup online -e -m -M nfs-files
# mkdir /files/conf/SUNW.nfs
# vi /files/conf/SUNW.nfs/dfstab.nfs-files-shares
[put nfs shares here related to pool files]
# clresourcetype register SUNW.nfs
# clresource create -g nfs-files -t SUNW.nfs -p Resource_dependencies=nfs-files-hastp nfs-files-shares


ps. right now it's available as Sun Cluster 3.2 beta - I have already two SC32 beta clusters running with ZFS and must say it just works - there were so minor problems at the beginning but developers from Sun Cluster team helped so fast that I'm still impressed - thank you guys! Right now it works perfectly.

Wednesday, September 06, 2006

Niagara II internals

Interesting article describing upcoming Niagara II CPU.

Also read here and here.

Tuesday, September 05, 2006

How much memory does ZFS consume?

When using ZFS standard tools give inaccurate values for free memory as ZFS doesn't use normal page cache and rather allocates directly kernel memory. When low-memory condition occurs ZFS should free its buffer memory. So how to get how much additional memory is possibly free?


bash-3.00# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 ufs md ip sctp usba fcp fctl lofs zfs random nfs crypto fcip cpc logindmux ptm ipc ]
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 859062 3355 41%
Anon 675625 2639 32%
Exec and libs 7994 31 0%
Page cache 39319 153 2%
Free (cachelist) 110881 433 5%
Free (freelist) 385592 1506 19%

Total 2078473 8119
Physical 2049122 8004
>

bash-3.00# echo "::kmastat"|mdb -k|grep zio_buf|awk 'BEGIN {c=0} {c=c+$5} END {print c}'
2923298816


So kernel consumes about 3.2TB of memory and about 2.7GB is allocated to ZFS buffers and basically should be treated as free memory. Approximately free memory on this host is: Free (cachelist) + Free (freelist) + 2923298816.

I guess small script which do all the calculations automatically would be useful.

Wednesday, August 23, 2006

Sun gains market share

Sun recoups server martek share:
"The Santa Clara, Calif.-based company's server revenue rose 15.5 percent to $1.59 billion in the quarter, according to statistics from research firm IDC. The increase outpaced the overall growth of 0.6 percent to $12.29 billion worldwide, with faster gains in x86 servers, blade servers and lower-end models costing less than $25,000.

Sun's three main rivals fared worse. In contrast, IBM's revenue dropped 2.2 percent to $3.42 billion; Hewlett-Packard's dropped 1.7 percent to $3.4 billion; and Dell's dropped 1.3 percent to $1.27 billion."

Wednesday, August 16, 2006

New servers from Sun

New HW from Sun:

  • US IV+ 1.8GHz
    • available in v490 an up
    • looks like it beats latest IBM's POWER5+ CPUs
  • X2100 M2 server
    • comparing to standard x2100 server it has latest 1200's Opterons, DDR2-667, 4x GbE
  • X2200 M2 server
    • 2x 2000s Opterons (dual-core), 64GB memory supported, 4x GbE, LOM, 2x HDD
  • Ultra 20 M2 workstation
    • comparing to U20 it has latest Opterons, 2x GbE, DDR2-667, better video

Sun's new servers page.
Official Sun announcement.
Related story.

Tuesday, August 08, 2006

HW RAID vs. ZFS software RAID - part II

This time I tested RAID-5 performance. I used the same hardware as in last RAID-10 benchmark.
I created RAID-5 volume consisting 6 disks on a 3510 head unit with 2 controllers, using random optimization. I also created software RAID-5 (aka RAID-Z) group using ZFS on 6 identical disks in a 3510 JBOD. Both HW and SW RAIDs were connected to the same host (v440). Using filebench's varmail test below are the results.

These tests show that software RAID-5 in ZFS can not only be as fast as hardware RAID-5 it can even be faster. The same is with RAID-10 - ZFS software RAID-10 was faster than hardware RAID-10.

Please note that I tested HW RAID on a 3510 FC array not on some junky PCI RAID card.


1. ZFS on HW RAID5 with 6 disks, atime=off
IO Summary: 444386 ops 7341.7 ops/s, (1129/1130 r/w) 36.1mb/s, 297us cpu/op, 6.6ms latency
IO Summary: 438649 ops 7247.0 ops/s, (1115/1115 r/w) 35.5mb/s, 293us cpu/op, 6.7ms latency

2. ZFS with software RAID-Z with 6 disks, atime=off
IO Summary: 457505 ops 7567.3 ops/s, (1164/1164 r/w) 37.2mb/s, 340us cpu/op, 6.4ms latency
IO Summary: 457767 ops 7567.8 ops/s, (1164/1165 r/w) 36.9mb/s, 340us cpu/op, 6.4ms latency

3. there's some problem in snv_44 with UFS so UFS test is on S10U2 in test #4
4. UFS on HW RAID5 with 6 disks, noatime, S10U2 + patches (the same filesystem mounted as in 3)
IO Summary: 393167 ops 6503.1 ops/s, (1000/1001 r/w) 32.4mb/s, 405us cpu/op, 7.5ms latency
IO Summary: 394525 ops 6521.2 ops/s, (1003/1003 r/w) 32.0mb/s, 407us cpu/op, 7.7ms latency

5. ZFS with software RAID-Z with 6 disks, atime=off, S10U2 + patches (the same disks as in test #2)
IO Summary: 461708 ops 7635.5 ops/s, (1175/1175 r/w) 37.4mb/s, 330us cpu/op, 6.4ms latency
IO Summary: 457649 ops 7562.1 ops/s, (1163/1164 r/w) 37.0mb/s, 328us cpu/op, 6.5ms latency


See my post on zfs-discuss@opensolaris.org list for more details.


I have also found some benchmarks comparing ZFS, UFS, RAISERFS and EXT3 - ZFS was of course the fastest one on the same x86 hardware. See here and here.

DTrace in Mac OS

Thanks to Alan Coopersmith I've just learned that DTrace will be part of MacOS X Leopard.

Mac OS Leopard Xcode:

Track down problems

When you need a bit more help in debugging, Xcode 3.0 offers an extraordinary new program, Xray. Taking its interface cues from timeline editors such as GarageBand, now you can visualize application performance like nothing you’ve seen before. Add different instruments so you can instantly see the results of code analyzers. Truly track read/write actions, UI events, and CPU load at the same time, so you can more easily determine relationships between them. Many such Xray instruments leverage the open source DTrace, now built into Mac OS X Leopard. Xray. Because it’s 2006.


btw: such a GUI tool would be useful for many Solaris admins too

Monday, August 07, 2006

HW RAID vs. ZFS software RAID

I used 3510 head unit with 73GB 15K disks, RAID-10 made of 12 disks in one enclosure.
On the other server (the same server specs) I used 3510 JBODs with the same disk models.

I used filebench to generate workloads. "varmail" workload was used for 60s, two runs for each config.


1. ZFS filesystem on HW lun with atime=off:

IO Summary: 499078 ops 8248.0 ops/s, (1269/1269 r/w) 40.6mb/s, 314us cpu/op, 6.0ms latency
IO Summary: 503112 ops 8320.2 ops/s, (1280/1280 r/w) 41.0mb/s, 296us cpu/op, 5.9ms latency

2. UFS filesystem on HW lun with maxcontig=24 and noatime:

IO Summary: 401671 ops 6638.2 ops/s, (1021/1021 r/w) 32.7mb/s, 404us cpu/op, 7.5ms latency
IO Summary: 403194 ops 6664.5 ops/s, (1025/1025 r/w) 32.5mb/s, 406us cpu/op, 7.5ms latency

3. ZFS filesystem with atime=off with ZFS raid-10 using 12 disks from one enclosure:
IO Summary: 558331 ops 9244.1 ops/s, (1422/1422 r/w) 45.2mb/s, 312us cpu/op, 5.2ms latency
IO Summary: 537542 ops 8899.9 ops/s, (1369/1369 r/w) 43.5mb/s, 307us cpu/op, 5.4ms latency


In other tests HW vs. ZFS software raid show about the same performance.
So it looks like at least in some workloads software ZFS raid can be faster than HW raid.
Also please notice that HW raid was done on real HW array and not some crappy PCI raid card.

For more details see my post on ZFS discuss list.

Thursday, August 03, 2006

Solaris Internals

Finally both books of new Solaris Internals are available. It's must buy for everyone seriously using Solaris.
See here and here.

Thursday, July 27, 2006

Home made Thumper?

Or rather not? See this blog entry and learn what is so different about Thumper. I can't wait I get one for testing. It could be just great architecture for NFS servers.

Saturday, July 22, 2006

New workstation from Sun?

Looks like we can expect new workstation from Sun soon. Look at BugID 6444550: "Next month, Munich workstation will be shipped."

Friday, July 21, 2006

UNIX DAYS - Gdansk 2006

My ZFS presentation and my Open Solaris presentation from last Unix Days. These presentations are in English. You can download there also other presentations from Unix Days however some of them are in Polish.

Thursday, July 20, 2006

ZFS would have saved a day

Ehhh... sometimes everything just crashes and then all you can do is to wait MANY hours for fsck, then again for fsck... well ZFS probably would have help here or maybe not as it's new technology and other problems could have aroused. Anyway we'll put it in a test as we're using ZFS more and more and someday we'll know :) This time famous 'FSCK YOU' hit me :(

Monday, July 17, 2006

Xen dom0 on Open Solaris

Open Solaris gets Xen dom0 support.
I haven't played with it yet but it looks like 32/64 bit is supported, MP (up-to 32-way) is supported, domU for Open Solaris/Linux, live migration - well lots of work in a short time. More details at Open Solaris Xen page. Some behind scene blog entry.

Tuesday, July 11, 2006

X4500 picture

Well, 2x dual Opteron + 48x SATA disks in 4U with list price about 2,5$ per GB.

Thursday, July 06, 2006

New hardware from Sun

On Tuesday's Network Computing Sun is going to show new Opteron servers. Some speculations on these servers from The Register.

Wednesday, July 05, 2006

Nexenta Zones

Nexenta got Zones support - well done. I'm really impressed with progress those guys (and women?) are doing.

Monday, July 03, 2006

FMA on x64

Mike wrote:
Last Monday, Sun officially released Solaris 10 6/06, our second update to Solaris 10. Among the many new features are some exciting enhancements to our Solaris Predictive Self-Healing feature set, including:
  • Fault management support for Opteron x64 systems, including CPU, Memory diagnosis and recovery,
  • Fault management support for SNMP traps and a new MIB for browsing fault management results, and
  • Fault management support for ZFS, which also is new in Solaris 10 6/06.

Thursday, June 29, 2006

Production debugging for sys admins

Well many people are curious if SystemTap is ready for production. On last UNIX DAYS we had an presentation about SystemTap prepared by our friend. Well during his preparations for the conference he almost got used to many system crashes a day. That alone speaks for itself. Then SystemTap currently lacks user space tracing and many many more things. I wouldn't put it in a production anytime soon and I don't know anyone who is using it.

James posted a well balanced blog entry about it.

Tuesday, June 27, 2006

Solaris 10 06/06

Finally Solaris 10 06/06 (update 2) is available. Read What's New.

ps. yes, ZFS is included and officially stable :)

Thursday, June 15, 2006

NexentaOS Alpha 5

NexentaOS Alpha 5 is available - just in time to celebrate Open Solaris One Year Anniversary.
You can download it here. I must say that it's truly amazing what people behind Nexenta are doing. Just after one year Open Solaris hit the streets they provide almost fully working GNU distribution based on Open Solaris and Debian. I'm really impressed.
"This release of NexentaOS brings to you fully integrated Ubuntu/Dapper Drake userland. Today NexentaOS APT repository contains more than 11,000 Ubuntu/Dapper packages. This number is constantly growing, driven mostly by our industry-strength AutoBuilder."

In addition, Alpha 5 contains:

  • Sun's Java SE 5.0 Java Development Kit (JDK(tm)) distributed under the new [WWW] Distributor's License for Java. Available via NexentaOS APT repository.

  • Live Upgrade. Starting from Alpha 5 we are supporting out-of-APT upgrade of the OpenSolaris core. Use Debian tools to bring your system up to the bleeding edge..

  • Minimal and full installation, safe mode boot option, removable drive support.

  • OpenOffice.org 2.0, natively compiled on NexentaOS.

  • OpenSolaris build #40, non-DEBUG kernel.

And also:

  • Xorg 7.0.x.

  • GNOME 2.14.x with a bunch of neat features, in particular Application Add/Remove.

  • KDE 3.5.2 and XFCE 4.3 alternative desktop environments.

And on top of that:

  • Samba 3.0 (server and client included with InstallCD), iSCSI, ZFS with the latest fixes and updates.

* Graphical .deb Package Installer

* Apache, MySQL, and Perl/Python/PHP

* Rhythmbox Notification

* Better Applications Menu Organization

* Firefox 1.5, Thunderbird 1.5

* Search Results in Nautilus

* New Log Out Dialog

* New polished look and feel

  • And 11,000 more packages, including the most popular office applications, graphics, and multi-media.

Putting your Code into Open Solaris

I'm definitely not a developer but still I can do some C programming. In order to better understand ZFS I was playing/looking with its sources like adding my own "compression" to ZFS during Christmas (I know... but it was really late night and I didn't want to sleep) or later I wanted to implement RFE: 6276934 ability import destroyed pools as I think that in some cases this would be very useful. Additionally while playing with ZFS sources I already knew that implementing this should be really simple and I wanted to test how in practice it's easy (or not) to get your code integrated into Open Solaris (and later into Solaris). I signed Contributor Agreement, made necessary code changes, tested it then made manual changes. Now I requested a Sponsor - Darren Moffat offered his help. ARC case was needed as new options were added, code review was also needed and some paper work. Thankfully for me Darren took care of all of this - thank you. A while later my code changes were integrated (snv_37) into Open Solaris and will also be available in upcoming Solaris 10 Update 2. You can read more about my changes here.

The point is that it's easy to get your code integrated to Open Solaris and you don't have to be a developer - if you are for example a system admin and you find something annoying (or lack of something) in Open Solaris you can easily fix it and share your fix with others. And that's one of the main goals of Open Solaris, isn't it?

There are people afraid that contributing code to Open Solaris could actually mean worse code - fortunately it's NOT the case as even if you are not from Sun you have to submit your changes to code review, ARC, follow coding style in Open Solaris, etc. and fortunately for you (submitters) Sun people will take care of this - you just write changes. That way a high quality of code in Open Solaris is preserved.

Here you can find other bug fixes by non-Sun people into Open Solaris. There are quite a lot of them just after one year Open Solaris is here.

Open Solaris Anniversary

Yesterday, June 14th, was a first year Open Solaris anniversary! What a year - lot of things happened and most of them good. Just quick glance at stats shows that Open Solaris already is a success with more people interest that everyone thought. I think that Open Solaris just after one year is far ahead of that we all expected it to be which is very good.

To celebrate PLOSUG formation and Open Solaris anniversary we had a first PLOSUG meeting yesterday.

Tuesday, June 13, 2006

Polish Open Solaris User Group

Hi.

Just after last Unix Days - Andrzej and I decided to create Polish Open Solaris User Group - PLOSUG. We were supposed to do it few weeks ago but... Anyway here we are. Tomorrow (June 14th) is the one year anniversary of Open Solaris so we think it's a good opportunity to get together and celebrate both: the anniversary and a creation of PLOSUG.

You can find more info about tomorrow meeting on PLOSUG page.

If you are from Poland and want to participate, talk, etc. about Open Solaris and its technologies, and also meet from time to time then please join to us.

PL-OSUG mailing-list is here.
PL-OSUG archives are here.

ps. this entry is in English however we'll mostly talk in Polish on PLOSUG mailing-list I guess.

Friday, June 09, 2006

fsck, strange files, etc.

In some places I use SATA drives for data storage. From time to time there is a problem with filesystems and I have to fsck, sometimes files are a little bit garbled, etc.
Last time I tried ZFS on SATA disks and to my surprise - just after 3 days I got few hundreds checksum error - well, that explains a lot. Then it stabilized and now from time to time I see occasional errors.

Sometimes we had to live with some problems for so long without any alternative that we have forgotten about a problem and got accustomed to fsck, some bad files, etc.


I would say that ZFS changes that picture radically.
Thanks to ZFS there's no need to fsck, proper data are returned to applications, no mangled fs entries, etc.
It already save our data :)

ps. I haven't yet seen checksum errors reported by ZFS on SCSI, FC or SAS disks...

Wednesday, June 07, 2006

Coolthreads warranty

I can't understand Sun's policy about warranty for Coolthread servers - only 90 days??!??! Their other entry level servers have at least 1yr - and all their opteron server have 4yrs by default (even x2100). So why T1000 has only 90 days of warranty? (T2000 was changed lately to 1yr).

To make things worse you can't buy T1000 with a Bronze support - you have to buy Silver at least - but that means quite a cost in a 2nd and 3rd year if you want 3 years "warranty" - I know you get more but sometimes you don't need more and all you need is a simple (cheap) warranty.

IMHO it should be corrected as soon as possible so Niagara servers are treated at least the same way as Opteron servers - 3yrs warranty by default. Bronze support should also be offered (at least it's not possible to buy bronze support for T1000/T2000 here in Poland).



Sun's warranty matrix for entry level servers.

Thursday, June 01, 2006

RAID-Z not always a good solution

We all know ZFS is great. RAID-Z means fast RAID-5 without HW RAID controller. However the devil is in the details - while raid-z is great for writing speed, data integrity, etc. its read performance could be really bad if you issue many small random reads from many streams and your dataset is big enough that your cache hit ratio is really small. Sometimes the solution could be to make a pool with many raid-z groups - it means less available storage, but better performance (in terms of IO/s).

So if you want to use raid-z in your environment first carefully consider your workload and if many raid-z in one pool aren't good solution for you then use different raids offered by ZFS. Fortunately other raids in ZFS are NOT affected that way.

If you need more details then read Roch's blog entry on it.

Minimizing Memory Usage for Creating Application Subprocesses

Interesting article on fork()/system()/popen()/posix_spawn() and memory overcommit.

Monday, May 15, 2006

T1000 arrived

Another T1000 arrived few days ago for testing - just after UNIX DAYS I'm going to start testing it.

Wednesday, May 10, 2006

USDT enhancements

Adam Leventhal has added new features to USDT probes in DTrace - especially is-enabled probes are a great add-on.

PSH + SMF = less downtime

Richard's Ranch has posted some info on Memory Page Retirement in Solaris which is part of Predictive Self Healing. If you want more details you should read: Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults.

What a coincidence because just one day earlier one of our servers encountered uncorrectable memory error. Fortunately it happened in user space so Solaris 10 just cleared that page, killed affected application and thanks to SMF application was automatically restarted. It all happened not only automatically but also quick enough that our monitoring detected problem AFTER Solaris already took care of it and everything was working properly.

Here we have a report in /var/adm/messages about problem with memory.

May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 321281 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL=0, errID 0x000c303b.ed832017
May 8 22:47:03 syrius.poczta.srv AFSR 0x00000000.00200000 AFAR 0x00000001.f0733b38
May 8 22:47:03 syrius.poczta.srv AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xffffffff7e7043c8
May 8 22:47:03 syrius.poczta.srv UDBH 0x00a0 UDBH.ESYND 0xa0 UDBL 0x02fc UDBL.ESYND 0xfc
May 8 22:47:03 syrius.poczta.srv UDBL Syndrome 0xfc Memory Module Board 6 J????
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 714160 kern.info] [AFT2] errID 0x000c303b.ed832017 PA=0x00000001.f0733b38
May 8 22:47:03 syrius.poczta.srv E$tag 0x00000000.18c03e0e E$State: Exclusive E$parity 0x0c
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x2d002d01.2d022d03
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x2d672d68.2d692d6a
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x2d6b2d09.2c912c92
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x2c932d0d.2d0e2d0f
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x2d102d11.2d122d13
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000000.09040000
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x000006ea.00002090
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x2091001c.2d1d2d1e *Bad* PSYND=0x00ff
May 8 22:47:03 syrius.poczta.srv unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000001.f0732000
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 863414 kern.info] [AFT3] errID 0x000c303b.ed832017 Above Error is in User Mode
May 8 22:47:03 syrius.poczta.srv and is fatal: will SIGKILL process and notify contract
May 8 22:47:20 syrius.poczta.srv unix: [ID 221039 kern.notice] NOTICE: Previously reported error on page 0x00000001.f0732000 cleared


Then by just using 'svcs' I learned which application was restarted and looked into the application's smf log file which has (XXXXX put instead of application path):

[ May 8 22:47:03 Stopping because process killed due to uncorrectable hardware error. ]
[ May 8 22:47:03 Executing stop method ("XXXXXXXX stop") ]
[ May 8 22:47:04 Method "stop" exited with status 0 ]
bash: line 1: 22242 Killed LD_PRELOAD=libumem.so.1 XXXXXXXX
[ May 8 22:48:44 Executing start method ("XXXXXXXXX start") ]
[ May 8 22:48:46 Method "start" exited with status 0 ]

UNIX DAYS - Gdansk 2006

UNIX DAYS - Gdansk 2006. This is a second edition of a conference about What's New in UNIX (on a production). The conference will be at Gdansk, Poland on May 18-19th. This time we managed to get Wirtualna Polska, Sun, Symantec, EMC, Implix and local university involved - thank you.

What is it about and how it started?
In October 2005 I thought about creating conference in Poland about new technologies in UNIX made by sysadmins for sysadmins (or by geeks for geeks). The idea was to present new technologies without any marketing crap - just technical. So I asked two of my friends to join me and help me make it real. Then I asked my company, Sun, Veritas and local university to sponsor us - and they did. That way UNIX DAYS 2004 was born. The conference took two days and all speakers were people who were actually using technologies they were talking about in a production environments. I must say that conference was very well received. Unfortunately due to lack of time there was no UNIX DAYS in 2005.

See you at UNIX DAYS!

update: well, we had to close public registration just after one day - over two hundred people registered in one day and we do not have free places yet. It really surprised us.

ps. our www pages are only in Polish - sorry about that.

Thursday, April 27, 2006

Hosting on Solaris instead of FreeBSD

Hosting webapps on Solaris instead of FreeBSD? Why not :) Looks like that is exactly what Joyent is going to do. Look at this pdf - how they consolidate on Niagara boxes, use zones, zfs, etc.

Monday, April 24, 2006

Software RAID-5 faster than RAID-10

Software RAID-5 solutions were always much slower in writing data than RAID-10. However ZFS completely changes that picture. Lets say we've got 4 disks - with all the other software RAID solutions when you create RAID-5 from these 5 disks then writing to them will be MUCH slower than if you created RAID-10. But if you create RAID-5 (in ZFS called RAID-Z) from these 5 disks the you will see that writing performance is actually much better then RAID-10.
I did such a test today with 4 disks and sequential writing to different RAID levels using ZFS and 'dd' command.



  • ZFS RAID ## write in MB/s
  • #############################################

  • RAID-10 (mirror+stripe) ## 117MB/s
  • RAID-Z (RAID-5) ## 175MB/s
  • RAID-0 (striping) ## 233MB/s


In theory ZFS in this test (writing) should give us the performance of two disks in case of RAID-10, the performance of 3 disks in case of RAID-Z and the performance of 4 disks in case of RAID-0. And that's exactly what we see above! (117/2 = 58.5000 ; 175/3 = 58.3333 ; 233/4 =58.2500).

Saturday, April 22, 2006

MySQL on Solaris faster than on Linux?

Yahoo reports that MySQL on Solaris is faster than on Linux.
SANTA CLARA, Calif., April 21 /PRNewswire-FirstCall/ -- Sun Microsystems, Inc. (Nasdaq: SUNW - News) today announced new benchmark results involving the performance of the open source MySQL database running online transaction processing (OLTP) workload on 8-way Sun Fire(TM) V40z servers. The testing, which measured the performance of both read/write and read-only operations, showed that MySQL 5.0.18 running on the Solaris(TM) 10 Operating System (OS) executed the same functions up to 64 percent faster in read/write mode and up to 91 percent faster in read-only mode than when it ran on the Red Hat Enterprise Linux 4 Advanced Server Edition OS.
"MySQL and Sun have worked to optimize the performance of MySQL Network software certified for the Solaris 10 OS," said Zack Urlocker, vice president of marketing for MySQL AB. "This benchmark demonstrates Sun's continued leadership in delivering a world-class operating environment that provides excellent MySQL performance for our mutual enterprise customers."
I wonder if it means there were some changes to MySQL so it runs faster on Solaris?
In our internal tests MySQL on Linux (however older MySQL versions) was actually slightly faster than on Solaris - maybe we should re-check it. Unfortunately not much info on actual system tuning and disk config is reported (was ZFS used? how much IOs were generated, etc.).

Actual benchmark is here and official Sun announcement here.

Friday, April 21, 2006

T1000 arrived

T1000 has just arrived! You can expect some production performance numbers soon :)
As T1000 uses only two memory controllers out of 4 in US-T1 it could be slower than T2000 with the same CPU - I guess it could depend on application - we'll see how it's here.

Wednesday, April 19, 2006

Predictive Self Healing

Yesterday system which is E6500 with Solaris 10 reported few times memory errors (ECC corrected) on a board 0 dimm J3300.


Apr 18 17:45:30 server SUNW,UltraSPARC-II: [ID 197328 kern.info] [AFT0] Corrected Memory Error detected by CPU8, errID 0x0002de3b.e98a88fa
Apr 18 17:45:30 server AFSR 0x00000000.00100000 AFAR 0x00000000.c709d428
Apr 18 17:45:30 server AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x11e7fe8
Apr 18 17:45:30 server UDBL Syndrome 0x64 Memory Module Board 0 J3300
Apr 18 17:45:30 server SUNW,UltraSPARC-II: [ID 429688 kern.info] [AFT0] errID 0x0002de3b.e98a88fa Corrected Memory Error on Board 0 J3300 is Intermittent
Apr 18 17:45:30 server SUNW,UltraSPARC-II: [ID 671797 kern.info] [AFT0] errID 0x0002de3b.e98a88fa ECC Data Bit 7 was in error and corrected


As these errors happened few more times system finally removed single page (8kB) from a system memory so problem will not escalate and won't possibly kill system or application. Well, I can live with 8kB less memory - that's real Predictive Self Healing.


Apr 18 19:14:51 server SUNW,UltraSPARC-II: [ID 566906 kern.warning] WARNING: [AFT0] Most recent 3 soft errors from Memory Module Board 0 J3300 exceed threshold (N=2, T=24h:00m) triggering page retire
Apr 18 19:14:51 server unix: [ID 618185 kern.notice] NOTICE: Scheduling removal of page 0x00000000.c709c000
Apr 18 19:15:50 server unix: [ID 693633 kern.notice] NOTICE: Page 0x00000000.c709c000 removed from service

Tuesday, April 18, 2006

Booting from ZFS

Booting from ZFS was integrated in b37 which is available as SXCR right now. It's not a final solution - it's ugly but it's a beginning. The real fun will begin when GRUB will understand ZFS. You can see instructions on how to boot from ZFS in b37 at Tabriz's weblog.

Monday, April 10, 2006

T2000 scalability

Another benchmark showing T2000 scalability in various microbenchmarks and comparing it to 12-way (24 cores) E2900. Most of the benchmarks are HPC focused.

BrandZ DVD

SystemNews:

"If you are interested in downloading the SolarisTM Containers for Linux Applications functionality, BrandZ DVD 35 for the SolarisTM Operating System (Solaris OS) for x86 platforms is available now as a free download from the Sun Download Center. Based on SolarisTM Express build 35, this download is a full install of Solaris OS and includes all of the modifications added by the BrandZ project.

BrandZ is a framework that extends the SolarisTM Zones infrastructure to create Branded Zones, which are zones that contain non-native operating environments. Each operating environment is provided by a brand that may be as simple as an environment with the standard Solaris utilities replaced by their GNU equivalents, or as complex as a complete Linux user space. The brand plugs into the BrandZ framework.

The combination of BrandZ and the lx brand, which enables Linux binary applications to run unmodified on Solaris OS, is Solaris Containers for Linux Applications. The lx brand is not a Linux distribution and does not contain any Linux software. The lx brand enables user-level Linux software to run on a machine with a Solaris kernel, and includes the tools necessary to install a CentOS or Red Hat Enterprise Linux distribution inside a zone on a Solaris system.

The lx brand will run on x86/x64 systems booted with either a 32-bit or 64-bit kernel. Regardless of the underlying kernel, only 32-bit Linux applications are able to run.

As part of the OpenSolaris community, BrandZ is a hosted project on this open source web site that contains detailed information on BrandZ and the latest information on build 35."

ps. yes, this is continuation of the famous Janus project (LAE)

Tuesday, April 04, 2006

Solaris Internals & Debugging

Finally new book from authors of the old Solaris Internals book - actually two books. One (over 1000 pages) about Kernel Architecture in Open Solaris and Solaris 10 and the other one about Solaris Performance and Tools (yep, DTrace). This time there are three authors: Richard McDougall, Jim Mauro, Brendan Gregg. Judging from the last book they wrote these two already are a must have books.

More info here.

Monday, April 03, 2006

CPU caps

New way of CPU management - CPU caps. It allows you to set limits on applications (projects and zones) in a percentage of single CPU (although application can still use all CPUs). Right now this is available as a preview on Open Solaris Resource Management Project.

Friday, March 24, 2006

Oracle 10g for Solaris x64

Finally Oracle 10g R2 64bit version for Solaris on x64 is released.

Another non-Sun Niagara test

It's good to see another Niagara tests done by people outside Sun - customers. We found Niagara servers to be really good in many workloads, not just www (both comparing to traditional SPARC servers or x86/x64 servers). Here you can find benchmark of T2000.

So, after a week with the Niagara T2000, I’ve managed to find some time to do some more detailed benchmarks, and the results are very impressive. The T2000 is definitely an impressive piece of equipment, it seems very, very capable, and we may very well end up going with the platform for our mirror server. Bottom line, the T2000 was able to handle over 3 times the number of transactions per-second and about 60% more concurrent downloads than the current ftp.heanet.ie machine can (a dual Itanium with 32Gb of memory) running identical software. Its advantages were even bigger than that again, when compared to a well-specced x86 machine. Not bad!




Friday, March 17, 2006

My patch integrated in Open Solaris

Finally my patch has been integrated into Open Solaris build 37. I must say that a procedure to integrate a patch into Open Solaris is really easy - just send a request for a sponsor (someone to help you) to request-sponsor list then someone will offer his/her self to be your sponsor and that's it. Of course it would be a good manner to first discuss the problem on related Open Solaris list if it involves new functionality, etc. If it's just simple bug you can skip this part.

What is the patch I wrote about? It adds functionality to ZFS so you can list and import previously destroyed pools. It was really simple but I guess it would be useful. The actual RFE is: 6276934. This should be available in Nevada build 37 and it looks like it will make it into Solaris 10 update 2.

Thursday, March 16, 2006

FMA for Opteron

Some time age I wrote that FMA enhancements for AMD CPUs are integrated into Open Solaris. Thanks to Gavin Maltby here are some details. Really worth reading.

Wednesday, March 15, 2006

The Rock and new servers

The Register has some rumors on new Rock processor:


The Rock processor - due out in 2008 - will have four cores or 16 cores, depending on how you slice the product. By that, we mean that Sun has divided the Rock CPU into four, separate cores each with four processing engines. Each core also has four FGUs (floating point/graphics units). Each processing engine will be able to crank two threads giving you - 4 x 4 x 2 - 32 threads per chip.


Sun appears to have a couple flavors of Rock – Pebble and Boulder. Our information on Pebble is pretty thin, although it appears to be the flavor of Rock meant to sit in one-socket servers. Boulder then powers two-socket, four-socket and eight-socket servers. The servers have been code-named "Supernova" and appear impressive indeed. A two-socket box – with 32 cores – will support up to 128 FB-DIMMs. The eight-socket boxes will support a whopping 512 FB-DIMMs. Sun appears to have some fancy shared memory tricks up its sleeve with this kit.

Monday, March 13, 2006

Ubuntu on Niagara

Well, that was fast. Looks like you can actually boot Linux/Ubuntu on Niagara!
Thanks to extraordinary efforts from David Miller, the Ubuntu SPARC team and the
entire Linux-on-SPARC community, it should now be possible to test out the
complete Ubuntu installer and environment on Niagara machines. As of today, the
unofficial community port of Ubuntu to SPARC should be installable on Niagara,
and we would love to hear reports of success or failure (and love them more if
they come with patches for performance or features :-)).

Thursday, March 02, 2006

2x E6500 on one T2000

In my previous blog entry I wrote that one T2000 (8 core, 1GHz) is approximately about 5-7 times the performance of a single E6500 (12x US-II 400MHz) in our production. Well to get even a better picture how it scales with our applications we created two Zones on the same T2000 but this time we put applications from one E6500 into one zone and applications from another E6500 (the same config) into second zone. Then we put these two zones into real production instead of these two E6500s.

These E6500s during peak hours are overloaded (most of the time 0% of IDLE cpu and dozen threads queued for running, some network packet drops, etc. - you get the idea). Well T2000 with exactly the same production workload is loaded at about 20% peak, no network packet drops, no threads queued. So there's still lot of head-room.

In order to see how T2000 is capable of doing IOs I increased some parameters in our applications so data processing was more aggressive - more nfs traffic and more CPU processing - all in a production with real data and workload. Well, T2000 was reading almost 500Mb/s from nfs servers, writing another 200Mb/s to nfs servers, and communicating with frontend servers with about 260Mb/s. And still no network packet drops, no threads queued up, server was loaded at about 30% peak (CPU). So there's still large head-room. And all of this traffic using internal on-board interfaces. When you add numbers you will get almost 1Gb/s real production traffic.

Unfortunately our T2000 has only 16GB of memory which was a little bit problematic and I couldn't push it even more. I whish I had T2000 with 32GB of ram and 1.2GHz UltraSparcT1 - I could try to consolidate even more gear and try more data processing.


ps. well, we're definitely buying another T2000s and putting them instead of E6500s, E4500s, ...

Applications weren'r recompiled for UltraSparcT1 - we use the same binaries as for E6500 and applications were configured exactly the same. NFS traffic is to really lot of small files with hundreds of threads doing so concurrently, with a lot of meta data manipulation (renaming, removing files, creating new ones, etc.) - so it's no simple sequential reading of big files. On-board GbE NICs were used on T2000. No special tuning was done especially for T2000 - the same tunables as for E6500s (larger TCP buffers, backlog queues, more number of nfs client threads per fs, etc.). Solaris 10 was used.

Wednesday, February 22, 2006

T2000 beats E6500

We put T2000 (8x core, 1GHz) instead of E6500 with 12x US-II 400MHz into our production. We've got heavily multithreaded applications here. Server is doing quite a lot of small NFS transactions and some basic data processing. We didn't recompile applications for T1 - we used the same binaries as for US-II. Applications do use FPU rarely.

T2000 gives as about 5-7x the performance of E6500 in that environment.

Well, "quite" good I would say :)


ps. probably we can squeeze even more from T2000. Right now 'coz lack of time we stay with 5-7x.

Monday, February 20, 2006

T2000 real web performance


We did real production benchmarks using different servers. Servers were put into production behind load-balancers, then weights on load-balancers were changed so we got highest number of dynamic PHP requests per second. It must sustain that number of requests for some time and no drops or request queue were allowed. With static requests numbers for Opteron and T2000 were even better but we are mostly interested in dynamic pages.

T2000 is over 4x faster than IBM dual Xeon 2.8GHz!

Except x335 server which was running Linux all the other servers were running Solaris 10. Our web server is developed on Linux platform so it's best tuned on it. After fixing some minor problems web server was recompiled on Solaris 10 update1 (both SPARC and x86). No special tuning was done to application and basic tuning on Solaris 10 (increased backlog, application in FX class). Web server was running in Solaris Zones. On x4100 and T2000 servers two instances of web server were run due to application scalability problems. On smaller servers it wasn't needed as CPU was fully utilized anyway. Minimal I/O's were issued to disks (only logs). Putting application into FX class helped a little bit.

Perhaps putting application in a global zone, doing some tuning to Solaris and application itself plus tweaking compiler options could get as even better results.

For more details on T2000 visit CoolThreads servers web page.
You can also see SPECweb2005 results which do include T2000.

Servers configuration:

1. IBM x335, 2x Xeon 2,8GHz (single core, 2 cores total)
2. Sun x2100, 1x Opteron 175 2,2GHz (dual core, 2 cores total)
3. Sun x4100, 2x Opteron 280 2,4GHz (dual core, 4 cores total)
4. Sun T2000, 1x UltraSparc T1 1GHz (8 cores, 8 cores total)
5. Sun T200o 6x, 1x UltraSparc T1 1GHz (8 cores, - two cores (8 logical) were switched off using psradm(1M).

Saturday, February 18, 2006

Linux kernel boots on Niagara

From OSNews:
The Linux kernel has booted on top of the sun4v hypervisor on Sun's new Niagara processor (it's just the kernel, there was no root filesystem).

SX 2/06 is out

Please notice that Solaris Express 2/06 is more tested than Solaris Express Community Edition. SX 2/06 is based on build 33 of Open Solaris (SX CE is based on b33 right now). There are lot of changes this time - see Dan Price's What's New.

Friday, February 17, 2006

FMA support for Opteron

While looking at latest changes to Open Solaris I found interesting integrations in current changelog:
  • BUG/RFE:6359264 Provide FMA support for AMD64 processors
  • BUG/RFE:6348407 Enable EFI partitions and device ID supports for hotpluggable devices
  • BUG/RFE:6377034 setting physmem in /etc/system does not have desired effect on x86
First one is most interesting - I hope someone from Sun will write a blog entry about it with more details.
Quickly looking at some files I can see that memory scrubbing for x86 was added (or maybe it was before on x86?). It also looks like page retirement on x86 is implemented.


These should be in Solaris Express build 34.