Robert Milkowski's blog: April 2007

This time I wanted to test softare RAID-10 vs. hardware RAID-10 on the same hardware. I used x4100 server with dual-ported 4Gb Qlogic HBA directly connected to EMC Clariion CX3-40 array (both links, each link connected to different storage processor). Operating system was Solaris 10U3 + patches. In case of hardware RAID, 4x RAID-10 groups were created each made of 10 disks (40 disks in total) and each group presented as a single LUN. So there were 4 LUNs, two on one storage processor and 2 on the other then ZFS striped pool over all 4 LUNs was created for better performance. In case of software raid, the same disks were used but each disk was presented as individual disk by Clarrion - 20 disks from one storage processor and 20 from the other. Then one large RAID-10 ZFS pool was created (the same disk pairs as with HW RAID). In both cases MPxIO was also enabled.

Additionally I included results for x4500 (Thumper) for comparison.

Before I go to the results some explaination of system names in graphs is needed.


x4100 HW   - hardware RAID as described above, pci-max-read-request=2048 set in qlc.conf
x4100 SW   - software RAID as described above
x4100 SW/Q - software RAID as described above, pci-max-read-request=2048 set in qlc.conf
x4500      - software RAID-10 pool made of 44 7200k 500GB disks (+2 hotspares +2 root disk)

As you'll see across all results, setting pci-max-read-request=2048 helps to boost results.
Also please notice that doing RAID-10 completely in software means that host has to write twice as much data to the array as when doing RAID-10 on the array. If enough disks are used and the array itself is not a bottleneck then we'll saturate links meaning we should get about half application streaming write performance with software RAID. In real life when your application doesn't issue as much writes it won't be a problem. Of course we're talking only about writes.

Keep in mind that workload parameters were such so actual workload was much larger that server's installed memory to minimaze file system caching.

We can observe it in first graph - HW RAID gives about 450MB/s for sequential writes and software RAID gives about 270MBs which is ~60%. Sequential reading on the other hand is a little bit better with software RAID.

1. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 128k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

Notice how good x4500 platform is for sequential reading/writing (locally of course, you won't be able to push it thru network). Additionally if you calculate total storage capacity and price x4500 is a winner here without any doubt by a very long margin.

Now lets see what results we will get with more common mixed workload - lots of files, 32 threads reading, writing, creating, deleting files, etc.

2. filebench - varmail workload, nfiles=100000, nthreads=32, meandirwidth=10000, meaniosize=16384, run 600
zfs set atime=off, recordsize=128k (default)

ZFS software RAID turned out to be the fastest by a low margin. This time x4500 is about 30% slower which is actually quite impressive (44x 7200k SATA disks vs. 40x 15000k FC disks).

I haven't been using IOzone much in a past so I thought it might be a good idea to try it.

3. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 16k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

Well x4500 is too good in above benchmark - I mean it looks like IOzone isn't issuing as random workload as one might have expected. Software RAID results are also too optimistic. Part of the problem could be that IOzone creates only a few files but large ones. It behaves more like database than file server. It is especially important with ZFS. So let's see what will happen if we match ZFS's recordsize to IOzone redord size.

4. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 16k -f /f5-1/test/i1
zfs set atime=off, recordsize=16k

Now we get much better throughput across all tests. It shows how important it is to match ZFS's recordsize to db record size in database environments. I was expecting writes to give less throughput with software RAID. Little bit unexpected is how much better results I get with stride reads. This time x4500's results are as expected - worst perfromer on reads, and great numbers on writes (this is due to ZFS which transforms most random writes to sequential writes which is very good for x4500 as we've seen in the first graph).

Lets see what happens if we increase both IOzone record size and ZFS record size to 128k.

5. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 128k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

It helped a lot as expected. Software RAID-10 was able to practically saturate both FC links.
Increasing block size was also very good for x4500's SATA disks which will excell in sequential reading/writing.

In all IOzone "Random mix" tests we observer x4500 to perform too good - it means that there're probably much more random writes than reads and due to ZFS's transorming random writes into sequential writes we got such good results. But it's a good result for ZFS - as it means that thanks to ZFS being much less dependend on seek time for random writes we can get some great performance characteristics out of SATA disks.

I belive that if I would have been able to directly connect disks as JBOD to the host entirely eliminating Clariion's storage processors I should have been get even better results especially in terms of random read/write IOPS. Under heave load all those storage processors are actually doing is to introduce some additional latency.

In different workloads, especially when you write from time to time and you're not saturating your disks nor an array cache you can potentially benefit from large non-voatile caches in an array. But then with sporadical writes you won't probably notice anyway. In most environments you won't notice an array read cache - you've got probably more memory in your server (or it's much cheaper to add memory to a server than to an array) or your active data set is so much bigger than any array cache size that it really doesn't matter if you have the cache or not.

When you think about it - most entry-level and mid-range arrays are just x86 servers inside. CX3-40 is a 2x 2.8GHz Xeon server...

See also my previous similar tests here, here and here.

So the question is - if you need a dedicated storage for a given workload does it make sense to buy mid-range arrays with storage processors, caches, etc.? Or maybe it's not only cheaper but also better in terms of raw perfromance to buy an FC, SAS, SATA (?) JBODS? I'm serious.

As it looks like in many workloads HW RAID will in real life give you less performance for higher price... but you get all the other features, right? Well, you get clones, snapshots... but you get them buit-in with ZFS and for most workloads ZFS clones and snapshots not only will give much better performance but won't need dedicated disks. Then management is much more easier with ZFS than with arrays - especially when you think about different software to manage arrays from different vendors. With ZFS it's all the same - just give it disks... Then ZFS is open source, is already ported to FreeBSD, is being ported to OS X and is for free. When was last time you had to call for EMC to reconfigure your array and yet pay for it?

Some people are concerned about enough bus bandwith when doing SW RAID. First it's not an issue for RAID-5, RAID-6 and RAID-0 as you have put through about the same volume of data regardless of when RAID is actually done. Then look at modern x86,x64 or RISC servers, even low-end ones and see how much IO bandwith they have and compare it to your actual environment. In most cases you don't have to worry about it. It was a problem 10-15 years ago, not now.

When doing RAID in ZFS you've also get end-to-end checksumming and self-healing for all of your data.

Now there're workloads when HW RAID actually makes sense. First RAID-5 for random reads workload will generally work better on the array than RAID-Z. But still it's worth considering to just buy more disks in a JBOD then spending all the money on an array.

There're also some features like remote synchronous replication which some arrays do offer and in some environments its needed.

The real issue with ZFS right now is its hot spare support and disk failure recovery. Right now it's barely working and it's nothing like you are accustomed to in arrays. It's being worked on right now by ZFS team so I expect it to quickly improve. But right now if you are afraid of disk failures and you can't afford any downtimes due to disk failure you should go with HW RAID and possibly with ZFS as a file system. In such a scenario I also encourage to expose to ZFS at least 3 luns made of different disks/raid groups and do dynamic striping in ZFS - that way ZFS's meta data will be protected.

The other problem is that it is hard to find a good JBODs, especially from tier-1 vendors.
Then it's harder to find large JBODs (in terms of # of disks). Would be really nice to be able to buy SAS/SATA JBOD with ability to add many expansion units, with 4-8 ports to servers, supported in a cluster configs, etc. Maybe a JBOD with SAS 2,5" disks packed similar to x4500 - this would give enourmous IOPS/CAPACITY per 1U...

ps. and remember - even with ZFS there's no one hammer...

HW details:
System : Solaris 10 11/06 (kernel Generic_125101-04)
Server : x4100M2 2xOpteron 2218 2.6GHz dual-core, 16GB RAM, dual ported 4Gb Qlogic HBA
Array : Clariion CX3-40, 73GB 15K 4Gb disks
X4500 : 500GB 7200K, 2x Opteron 285 (2.6GHz dual-core), 16GB RAM

Have you ever wondered what files are most accessed on your nfs server? How good are those files cached? You've got many nfs clients...

We've put new nfs server on Solaris 10, Opteron server, Sun Cluster 3.2, ZFS, etc.
So far only part of production data are served and we see somewhat surprising numbers.


bash-3.00# /usr/local/sbin/nicstat.pl 10 3
[omitting first output]
   Time   Int   rKb/s   wKb/s   rPk/s   wPk/s    rAvs    wAvs   %Util     Sat
03:04:20  nge1    0.07    0.05    1.20    1.20   61.50   46.67    0.00    0.00
03:04:20  nge0    0.07    0.05    1.20    1.20   61.50   46.67    0.00    0.00
03:04:20 e1000g1   71.87    0.13  446.22    1.20  164.92  114.83    0.06    0.00
03:04:20 e1000g0    0.34 10117.91    5.40 7120.07   64.00 1455.15    8.29    0.00
   Time   Int   rKb/s   wKb/s   rPk/s   wPk/s    rAvs    wAvs   %Util     Sat
03:04:30  nge1    0.08    0.06    1.30    1.30   62.77   47.54    0.00    0.00
03:04:30  nge0    0.08    0.06    1.30    1.30   62.77   47.54    0.00    0.00
03:04:30 e1000g1   69.13    0.14  430.27    1.30  164.53  110.92    0.06    0.00
03:04:30 e1000g0    0.43 9827.54    6.79 6914.19   64.29 1455.47    8.05    0.00
bash-3.00#

So we have 9-10MB/s being served.


bash-3.00# iostat -xnz 1
[omitting first output]
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
^C

Well but we do not touch disks at all.
'zpool iostat 1' also confirms that.

Now I wonder what files are we actually serving right now.


bash-3.00# ./rfileio.d 10s
Tracing...

Read IOPS, top 20 (count)
/media/d001/a/nfs.wsx                                     logical        101
/media/d001/a/0410_komentarz_walutowy.wmv                 logical        712
/media/d001/a/0410_komentarz_gieldowy.wmv                 logical       3654

Read Bandwidth, top 20 (bytes)
/media/d001/a/nfs.wsx                                     logical     188264
/media/d001/a/0410_komentarz_walutowy.wmv                 logical    1016832
/media/d001/a/0410_komentarz_gieldowy.wmv                 logical   96774144

Total File System miss-rate: 0%
^C

In 10 seconds we read ~95MB so it agrees with 9-10MB/s as nicstat reported. Everything is read as "logical" - agrees.
And most important - we now which files are served!
So it's time to tune nfs clients... :)

You can find rfileio.d script in the DTraceToolkit (although I modified it slightly).

Now imagine what you can do with such possibilities on more busy servers. You don't have to guess what files are most served and how good they cache. Using another script 'rfileio.d' you can break down statistics by file systems. And if you want to customize them you can easily and safely do so as those scripts are written in DTrace.

Of course all of the above is safe to run in a production - that's most important thing.

Additionally to put it clearly - I did it on nfs server, not nfs clients so it doesn't matter if your clients are *BSD, Linux, Windows, Solaris, ... as long as your nfs server is running Solaris.

Robert Milkowski's blog

Tuesday, April 24, 2007

Windows on Thumper

HW RAID vs. ZFS software RAID - part III

New SPARC Servers

Thursday, April 19, 2007

NFS server - file stats

Wednesday, April 18, 2007

New Sun low-end array

Wednesday, April 11, 2007

Rock

Friday, April 06, 2007

FreeBSD + ZFS