Robert Milkowski's blog

Monday, May 21, 2007

Unix Days - Gdansk 2007

Next week there's going to be 3rd edition of Unix Days conference in Poland. We've just opened up public registration. Last year we had to close public registration before even 24h lasted as there're only little above 200 seats available - not all people attend all days and all lectures so we can allow about 300 people to register. We were very happy with last two editions (and thanks to questionnaires we know you were too) and we hope this one will be even better. This year we decided to extend the conference to 3 days comparing to two days last time. This means more presentations to attend. This is also a great opportunity for sys admins to know each other better in real life - especially during evening party to all who attended the conference.

See you there!

UPDATE:

The conference went well - over 260 people were there - good. You can find presentations here - some of them are in English.

Wednesday, May 09, 2007

NPort ID Virtualization

Aaron wrote:

What do I do all day at the office? Lately, I've been working on adding NPort ID Virtualization (NPIV) to our Leadville FibreChannel stack.

At a high level, you can think of NPIV as allowing one physical FibreChannel HBA to log in multiple times to the SAN, and so you have many virtual HBAs.
Why is this interesting?

The first thing I thought of when I heard about this is hypervisor applications, like Xen. If you have one world wide name per DOMU (in Xen terminology), you can do the same zoning/lun masking that you've always done per server, but this times it's per DOMU.

Another use is that if your HBA breaks, and you have to replace it, you can use the old WWN on the new HBA, and you won't have to rezone your SAN.

More details on NPIV.

Tuesday, April 24, 2007

Windows on Thumper

Sun Microsystems loaned an X4500 to the Johns Hopkins University Physics Department in Baltimore, Maryland to do Windows-SQLserver performance experiments and to be a public resource for services like SkyServer.org, LifeUnderYourFeet.net, and CasJobs.sdss.org.

This is the fastest Intel/AMD system we have ever benchmarked. The 6+ GB/s memory system (4.5GB/s copy) is very promising.

Read full report.

HW RAID vs. ZFS software RAID - part III

This time I wanted to test softare RAID-10 vs. hardware RAID-10 on the same hardware. I used x4100 server with dual-ported 4Gb Qlogic HBA directly connected to EMC Clariion CX3-40 array (both links, each link connected to different storage processor). Operating system was Solaris 10U3 + patches. In case of hardware RAID, 4x RAID-10 groups were created each made of 10 disks (40 disks in total) and each group presented as a single LUN. So there were 4 LUNs, two on one storage processor and 2 on the other then ZFS striped pool over all 4 LUNs was created for better performance. In case of software raid, the same disks were used but each disk was presented as individual disk by Clarrion - 20 disks from one storage processor and 20 from the other. Then one large RAID-10 ZFS pool was created (the same disk pairs as with HW RAID). In both cases MPxIO was also enabled.

Additionally I included results for x4500 (Thumper) for comparison.

Before I go to the results some explaination of system names in graphs is needed.


x4100 HW   - hardware RAID as described above, pci-max-read-request=2048 set in qlc.conf
x4100 SW   - software RAID as described above
x4100 SW/Q - software RAID as described above, pci-max-read-request=2048 set in qlc.conf
x4500      - software RAID-10 pool made of 44 7200k 500GB disks (+2 hotspares +2 root disk)

As you'll see across all results, setting pci-max-read-request=2048 helps to boost results.
Also please notice that doing RAID-10 completely in software means that host has to write twice as much data to the array as when doing RAID-10 on the array. If enough disks are used and the array itself is not a bottleneck then we'll saturate links meaning we should get about half application streaming write performance with software RAID. In real life when your application doesn't issue as much writes it won't be a problem. Of course we're talking only about writes.

Keep in mind that workload parameters were such so actual workload was much larger that server's installed memory to minimaze file system caching.

We can observe it in first graph - HW RAID gives about 450MB/s for sequential writes and software RAID gives about 270MBs which is ~60%. Sequential reading on the other hand is a little bit better with software RAID.

1. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 128k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

Notice how good x4500 platform is for sequential reading/writing (locally of course, you won't be able to push it thru network). Additionally if you calculate total storage capacity and price x4500 is a winner here without any doubt by a very long margin.

Now lets see what results we will get with more common mixed workload - lots of files, 32 threads reading, writing, creating, deleting files, etc.

2. filebench - varmail workload, nfiles=100000, nthreads=32, meandirwidth=10000, meaniosize=16384, run 600
zfs set atime=off, recordsize=128k (default)

ZFS software RAID turned out to be the fastest by a low margin. This time x4500 is about 30% slower which is actually quite impressive (44x 7200k SATA disks vs. 40x 15000k FC disks).

I haven't been using IOzone much in a past so I thought it might be a good idea to try it.

3. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 16k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

Well x4500 is too good in above benchmark - I mean it looks like IOzone isn't issuing as random workload as one might have expected. Software RAID results are also too optimistic. Part of the problem could be that IOzone creates only a few files but large ones. It behaves more like database than file server. It is especially important with ZFS. So let's see what will happen if we match ZFS's recordsize to IOzone redord size.

4. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 16k -f /f5-1/test/i1
zfs set atime=off, recordsize=16k

Now we get much better throughput across all tests. It shows how important it is to match ZFS's recordsize to db record size in database environments. I was expecting writes to give less throughput with software RAID. Little bit unexpected is how much better results I get with stride reads. This time x4500's results are as expected - worst perfromer on reads, and great numbers on writes (this is due to ZFS which transforms most random writes to sequential writes which is very good for x4500 as we've seen in the first graph).

Lets see what happens if we increase both IOzone record size and ZFS record size to 128k.

5. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 128k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

It helped a lot as expected. Software RAID-10 was able to practically saturate both FC links.
Increasing block size was also very good for x4500's SATA disks which will excell in sequential reading/writing.

In all IOzone "Random mix" tests we observer x4500 to perform too good - it means that there're probably much more random writes than reads and due to ZFS's transorming random writes into sequential writes we got such good results. But it's a good result for ZFS - as it means that thanks to ZFS being much less dependend on seek time for random writes we can get some great performance characteristics out of SATA disks.

I belive that if I would have been able to directly connect disks as JBOD to the host entirely eliminating Clariion's storage processors I should have been get even better results especially in terms of random read/write IOPS. Under heave load all those storage processors are actually doing is to introduce some additional latency.

In different workloads, especially when you write from time to time and you're not saturating your disks nor an array cache you can potentially benefit from large non-voatile caches in an array. But then with sporadical writes you won't probably notice anyway. In most environments you won't notice an array read cache - you've got probably more memory in your server (or it's much cheaper to add memory to a server than to an array) or your active data set is so much bigger than any array cache size that it really doesn't matter if you have the cache or not.

When you think about it - most entry-level and mid-range arrays are just x86 servers inside. CX3-40 is a 2x 2.8GHz Xeon server...

See also my previous similar tests here, here and here.

So the question is - if you need a dedicated storage for a given workload does it make sense to buy mid-range arrays with storage processors, caches, etc.? Or maybe it's not only cheaper but also better in terms of raw perfromance to buy an FC, SAS, SATA (?) JBODS? I'm serious.

As it looks like in many workloads HW RAID will in real life give you less performance for higher price... but you get all the other features, right? Well, you get clones, snapshots... but you get them buit-in with ZFS and for most workloads ZFS clones and snapshots not only will give much better performance but won't need dedicated disks. Then management is much more easier with ZFS than with arrays - especially when you think about different software to manage arrays from different vendors. With ZFS it's all the same - just give it disks... Then ZFS is open source, is already ported to FreeBSD, is being ported to OS X and is for free. When was last time you had to call for EMC to reconfigure your array and yet pay for it?

Some people are concerned about enough bus bandwith when doing SW RAID. First it's not an issue for RAID-5, RAID-6 and RAID-0 as you have put through about the same volume of data regardless of when RAID is actually done. Then look at modern x86,x64 or RISC servers, even low-end ones and see how much IO bandwith they have and compare it to your actual environment. In most cases you don't have to worry about it. It was a problem 10-15 years ago, not now.

When doing RAID in ZFS you've also get end-to-end checksumming and self-healing for all of your data.

Now there're workloads when HW RAID actually makes sense. First RAID-5 for random reads workload will generally work better on the array than RAID-Z. But still it's worth considering to just buy more disks in a JBOD then spending all the money on an array.

There're also some features like remote synchronous replication which some arrays do offer and in some environments its needed.

The real issue with ZFS right now is its hot spare support and disk failure recovery. Right now it's barely working and it's nothing like you are accustomed to in arrays. It's being worked on right now by ZFS team so I expect it to quickly improve. But right now if you are afraid of disk failures and you can't afford any downtimes due to disk failure you should go with HW RAID and possibly with ZFS as a file system. In such a scenario I also encourage to expose to ZFS at least 3 luns made of different disks/raid groups and do dynamic striping in ZFS - that way ZFS's meta data will be protected.

The other problem is that it is hard to find a good JBODs, especially from tier-1 vendors.
Then it's harder to find large JBODs (in terms of # of disks). Would be really nice to be able to buy SAS/SATA JBOD with ability to add many expansion units, with 4-8 ports to servers, supported in a cluster configs, etc. Maybe a JBOD with SAS 2,5" disks packed similar to x4500 - this would give enourmous IOPS/CAPACITY per 1U...

ps. and remember - even with ZFS there's no one hammer...

HW details:
System : Solaris 10 11/06 (kernel Generic_125101-04)
Server : x4100M2 2xOpteron 2218 2.6GHz dual-core, 16GB RAM, dual ported 4Gb Qlogic HBA
Array : Clariion CX3-40, 73GB 15K 4Gb disks
X4500 : 500GB 7200K, 2x Opteron 285 (2.6GHz dual-core), 16GB RAM

New SPARC Servers

Sun announced new servers based on new SPARC CPU designed in co-operation with Fujitsu. Those new models are: M4000, M5000, M8000, M9000.

See Richard Elling's blog entry about RAS features of this architecture.
Also see Jonathan Schwartz's post.

Thursday, April 19, 2007

NFS server - file stats

Have you ever wondered what files are most accessed on your nfs server? How good are those files cached? You've got many nfs clients...

We've put new nfs server on Solaris 10, Opteron server, Sun Cluster 3.2, ZFS, etc.
So far only part of production data are served and we see somewhat surprising numbers.


bash-3.00# /usr/local/sbin/nicstat.pl 10 3
[omitting first output]
   Time   Int   rKb/s   wKb/s   rPk/s   wPk/s    rAvs    wAvs   %Util     Sat
03:04:20  nge1    0.07    0.05    1.20    1.20   61.50   46.67    0.00    0.00
03:04:20  nge0    0.07    0.05    1.20    1.20   61.50   46.67    0.00    0.00
03:04:20 e1000g1   71.87    0.13  446.22    1.20  164.92  114.83    0.06    0.00
03:04:20 e1000g0    0.34 10117.91    5.40 7120.07   64.00 1455.15    8.29    0.00
   Time   Int   rKb/s   wKb/s   rPk/s   wPk/s    rAvs    wAvs   %Util     Sat
03:04:30  nge1    0.08    0.06    1.30    1.30   62.77   47.54    0.00    0.00
03:04:30  nge0    0.08    0.06    1.30    1.30   62.77   47.54    0.00    0.00
03:04:30 e1000g1   69.13    0.14  430.27    1.30  164.53  110.92    0.06    0.00
03:04:30 e1000g0    0.43 9827.54    6.79 6914.19   64.29 1455.47    8.05    0.00
bash-3.00#

So we have 9-10MB/s being served.


bash-3.00# iostat -xnz 1
[omitting first output]
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
^C

Well but we do not touch disks at all.
'zpool iostat 1' also confirms that.

Now I wonder what files are we actually serving right now.


bash-3.00# ./rfileio.d 10s
Tracing...

Read IOPS, top 20 (count)
/media/d001/a/nfs.wsx                                     logical        101
/media/d001/a/0410_komentarz_walutowy.wmv                 logical        712
/media/d001/a/0410_komentarz_gieldowy.wmv                 logical       3654

Read Bandwidth, top 20 (bytes)
/media/d001/a/nfs.wsx                                     logical     188264
/media/d001/a/0410_komentarz_walutowy.wmv                 logical    1016832
/media/d001/a/0410_komentarz_gieldowy.wmv                 logical   96774144

Total File System miss-rate: 0%
^C

In 10 seconds we read ~95MB so it agrees with 9-10MB/s as nicstat reported. Everything is read as "logical" - agrees.
And most important - we now which files are served!
So it's time to tune nfs clients... :)

You can find rfileio.d script in the DTraceToolkit (although I modified it slightly).

Now imagine what you can do with such possibilities on more busy servers. You don't have to guess what files are most served and how good they cache. Using another script 'rfileio.d' you can break down statistics by file systems. And if you want to customize them you can easily and safely do so as those scripts are written in DTrace.

Of course all of the above is safe to run in a production - that's most important thing.

Additionally to put it clearly - I did it on nfs server, not nfs clients so it doesn't matter if your clients are *BSD, Linux, Windows, Solaris, ... as long as your nfs server is running Solaris.

Wednesday, April 18, 2007

New Sun low-end array

Sun StorageTek 2500 series. There's also new SAS HBA.

Wednesday, April 11, 2007

Rock

Friday, April 06, 2007

FreeBSD + ZFS

Pawel Dawidek has been working on a ZFS port to FreeBSD for some time now. He's just announced that ZFS is integrated into FreeBSD. Congratulations for hard work!

See threads on Open Solaris and FreeBSD lists.

I'm also very glad that Pawel will be on of a presenters at Unix Days '07. He'll be talking about FreeBSD.

Wednesday, March 28, 2007

Latest ZFS add-ons

ZFS boot support was just integrated (x86/x64 platform for now). It will be available in SXCE build 62.
Yes, we'll be able to boot directly from ZFS - that would definitely make life easier - no more hassle with separate partitions and their sizing, snapshots and clones for / and much easier live upgrade - those are just some examples. In b62 installer won't know about ZFS (yet) so some manual fiddling will be required to install system on ZFS.

Also in b62 gzip compression was integrated into ZFS (additional to ljzb) thanks to Adam Leventhal. It not only can save you lot of space transparently to application but in some workloads it can actually speed up disk access (if there's free CPU, disk IO is a bottleneck and data are good candidate for compression). We've been using zfs built-in compression (ljzb) for quite some time on LDAP servers - on disk database size reduced 2x and we've also gained some performance. It would be interesting to try ZFS/gzip.

Ditto block support for data blocks was integrated in b61. It means that we can set new property per fs basis (zfs set copies=N fs, N=1 by default) to instruct zfs to write N (1-3) copies of data regardless of a pool protection. Like with ditto blocks for meta data if your pool has more vdevs each copy will be on different disk.

ZFS support for iSCSI was integrated in b54. It greatly simplifies exposing ZVOLs via iSCSI in the same way sharenfs simplifies sharing file systems over nfs.

In case you haven't noticed 'zpool history' feature was integrated into b51. It stores zfs commands history in a pool itself so you can see what was happening.

Of course lots of bug and performance fixes were also integrated recently.

Thursday, March 22, 2007

Temple of the Sun

You can win 5000$ :)

Monday, March 19, 2007

ZFS online replication

During last Christmas I was playing with ZFS code again and I figured out that adding online replication of ZFS file systems should be quite easy to implement. By online replication I mean one-to-one relation between two file systems, potentially on different servers, and all modifications done to one file system are asynchronously replicated to the other one with a small delay (like few seconds). Additionally one should be able to snapshot remote file system independently to get point-in-time copies and resume replication from automatically created snapshots on both ends at given intervals. The good thing is that once you're just few seconds behind you should get all transactions from memory so you get a remote copy of your file system without generating any additional IOs on a backuped one.

Due to some reasons I haven't done it myself rather I asked one of my developers to actually implement such tool and here we are :)


bash-3.00# zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
solaris         5.13G  11.6G  24.5K  /solaris
solaris/testws  5.13G  11.6G  5.13G  /export/testws/
bash-3.00# zfs create solaris/d100
bash-3.00# zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
solaris         5.13G  11.6G  26.5K  /solaris
solaris/d100    24.5K  11.6G  24.5K  /solaris/d100
solaris/testws  5.13G  11.6G  5.13G  /export/testws/
bash-3.00#

Now in another terminal:


bash-3.00# ./zreplicate send solaris/d100 | ./zreplicate receive solaris/d100-copy

Back to original terminal:


bash-3.00# zfs list
NAME                USED  AVAIL  REFER  MOUNTPOINT
solaris            5.13G  11.6G  26.5K  /solaris
solaris/d100       24.5K  11.6G  24.5K  /solaris/d100
solaris/d100-copy  24.5K  11.6G  24.5K  /solaris/d100-copy
solaris/testws     5.13G  11.6G  5.13G  /export/testws/
bash-3.00#
bash-3.00# cp /platform/i86pc/boot_archive /solaris/d100/
bash-3.00# zfs list
NAME                USED  AVAIL  REFER  MOUNTPOINT
solaris            5.15G  11.6G  26.5K  /solaris
solaris/d100       12.0M  11.6G  12.0M  /solaris/d100
solaris/d100-copy  12.0M  11.6G  12.0M  /solaris/d100-copy
solaris/testws     5.13G  11.6G  5.13G  /export/testws/
bash-3.00#
bash-3.00# rm /solaris/d100/boot_archive
bash-3.00# zfs list
NAME                USED  AVAIL  REFER  MOUNTPOINT
solaris            5.14G  11.6G  26.5K  /solaris
solaris/d100       24.5K  11.6G  24.5K  /solaris/d100
solaris/d100-copy  12.0M  11.6G  12.0M  /solaris/d100-copy
solaris/testws     5.13G  11.6G  5.13G  /export/testws/
bash-3.00# zfs list
NAME                USED  AVAIL  REFER  MOUNTPOINT
solaris            5.13G  11.6G  26.5K  /solaris
solaris/d100       24.5K  11.6G  24.5K  /solaris/d100
solaris/d100-copy  24.5K  11.6G  24.5K  /solaris/d100-copy
solaris/testws     5.13G  11.6G  5.13G  /export/testws/
bash-3.00#

bash-3.00# cp /platform/i86pc/boot_archive /solaris/d100/
[stop replication in another terminal]
bash-3.00# zfs mount -a
bash-3.00# digest -a md5 /solaris/d100/boot_archive
33e242158c6eb691d23ce2c522a7bf55
bash-3.00# digest -a md5 /solaris/d100-copy/boot_archive 
33e242158c6eb691d23ce2c522a7bf55
bash-3.00#

Bingo! All modifications to solaris/d100 are automatically replicated to solaris/d100-copy. Of course you can replicate over the network to remote server using ssh.

There're still some minor problems but generally the tool works as expected.

Once the first phase is implemented we will probably start second one - to implement a tool to manage replications between servers (like automatic replication setup if new file system is created, replication resume in case of a problem, etc.).

There're other approaches which create a snapshots and then incrementally replicate them to remote side in given intervals. While our approach is very similar it's more elegant and gives you almost on-line replication. What do you think?