Tuesday, April 24, 2007

HW RAID vs. ZFS software RAID - part III

This time I wanted to test softare RAID-10 vs. hardware RAID-10 on the same hardware. I used x4100 server with dual-ported 4Gb Qlogic HBA directly connected to EMC Clariion CX3-40 array (both links, each link connected to different storage processor). Operating system was Solaris 10U3 + patches. In case of hardware RAID, 4x RAID-10 groups were created each made of 10 disks (40 disks in total) and each group presented as a single LUN. So there were 4 LUNs, two on one storage processor and 2 on the other then ZFS striped pool over all 4 LUNs was created for better performance. In case of software raid, the same disks were used but each disk was presented as individual disk by Clarrion - 20 disks from one storage processor and 20 from the other. Then one large RAID-10 ZFS pool was created (the same disk pairs as with HW RAID). In both cases MPxIO was also enabled.

Additionally I included results for x4500 (Thumper) for comparison.

Before I go to the results some explaination of system names in graphs is needed.

x4100 HW - hardware RAID as described above, pci-max-read-request=2048 set in qlc.conf
x4100 SW - software RAID as described above
x4100 SW/Q - software RAID as described above, pci-max-read-request=2048 set in qlc.conf
x4500 - software RAID-10 pool made of 44 7200k 500GB disks (+2 hotspares +2 root disk)

As you'll see across all results, setting pci-max-read-request=2048 helps to boost results.
Also please notice that doing RAID-10 completely in software means that host has to write twice as much data to the array as when doing RAID-10 on the array. If enough disks are used and the array itself is not a bottleneck then we'll saturate links meaning we should get about half application streaming write performance with software RAID. In real life when your application doesn't issue as much writes it won't be a problem. Of course we're talking only about writes.

Keep in mind that workload parameters were such so actual workload was much larger that server's installed memory to minimaze file system caching.

We can observe it in first graph - HW RAID gives about 450MB/s for sequential writes and software RAID gives about 270MBs which is ~60%. Sequential reading on the other hand is a little bit better with software RAID.

1. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 128k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

Notice how good x4500 platform is for sequential reading/writing (locally of course, you won't be able to push it thru network). Additionally if you calculate total storage capacity and price x4500 is a winner here without any doubt by a very long margin.

Now lets see what results we will get with more common mixed workload - lots of files, 32 threads reading, writing, creating, deleting files, etc.

2. filebench - varmail workload, nfiles=100000, nthreads=32, meandirwidth=10000, meaniosize=16384, run 600
zfs set atime=off, recordsize=128k (default)

ZFS software RAID turned out to be the fastest by a low margin. This time x4500 is about 30% slower which is actually quite impressive (44x 7200k SATA disks vs. 40x 15000k FC disks).

I haven't been using IOzone much in a past so I thought it might be a good idea to try it.

3. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 16k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

Well x4500 is too good in above benchmark - I mean it looks like IOzone isn't issuing as random workload as one might have expected. Software RAID results are also too optimistic. Part of the problem could be that IOzone creates only a few files but large ones. It behaves more like database than file server. It is especially important with ZFS. So let's see what will happen if we match ZFS's recordsize to IOzone redord size.

4. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 16k -f /f5-1/test/i1
zfs set atime=off, recordsize=16k

Now we get much better throughput across all tests. It shows how important it is to match ZFS's recordsize to db record size in database environments. I was expecting writes to give less throughput with software RAID. Little bit unexpected is how much better results I get with stride reads. This time x4500's results are as expected - worst perfromer on reads, and great numbers on writes (this is due to ZFS which transforms most random writes to sequential writes which is very good for x4500 as we've seen in the first graph).

Lets see what happens if we increase both IOzone record size and ZFS record size to 128k.

5. iozone -b iozone-1.wks -R -e -s 2g -t 32 -M -r 128k -f /f5-1/test/i1
zfs set atime=off, recordsize=128k (default)

It helped a lot as expected. Software RAID-10 was able to practically saturate both FC links.
Increasing block size was also very good for x4500's SATA disks which will excell in sequential reading/writing.

In all IOzone "Random mix" tests we observer x4500 to perform too good - it means that there're probably much more random writes than reads and due to ZFS's transorming random writes into sequential writes we got such good results. But it's a good result for ZFS - as it means that thanks to ZFS being much less dependend on seek time for random writes we can get some great performance characteristics out of SATA disks.

I belive that if I would have been able to directly connect disks as JBOD to the host entirely eliminating Clariion's storage processors I should have been get even better results especially in terms of random read/write IOPS. Under heave load all those storage processors are actually doing is to introduce some additional latency.

In different workloads, especially when you write from time to time and you're not saturating your disks nor an array cache you can potentially benefit from large non-voatile caches in an array. But then with sporadical writes you won't probably notice anyway. In most environments you won't notice an array read cache - you've got probably more memory in your server (or it's much cheaper to add memory to a server than to an array) or your active data set is so much bigger than any array cache size that it really doesn't matter if you have the cache or not.

When you think about it - most entry-level and mid-range arrays are just x86 servers inside. CX3-40 is a 2x 2.8GHz Xeon server...

See also my previous similar tests here, here and here.

So the question is - if you need a dedicated storage for a given workload does it make sense to buy mid-range arrays with storage processors, caches, etc.? Or maybe it's not only cheaper but also better in terms of raw perfromance to buy an FC, SAS, SATA (?) JBODS? I'm serious.

As it looks like in many workloads HW RAID will in real life give you less performance for higher price... but you get all the other features, right? Well, you get clones, snapshots... but you get them buit-in with ZFS and for most workloads ZFS clones and snapshots not only will give much better performance but won't need dedicated disks. Then management is much more easier with ZFS than with arrays - especially when you think about different software to manage arrays from different vendors. With ZFS it's all the same - just give it disks... Then ZFS is open source, is already ported to FreeBSD, is being ported to OS X and is for free. When was last time you had to call for EMC to reconfigure your array and yet pay for it?

Some people are concerned about enough bus bandwith when doing SW RAID. First it's not an issue for RAID-5, RAID-6 and RAID-0 as you have put through about the same volume of data regardless of when RAID is actually done. Then look at modern x86,x64 or RISC servers, even low-end ones and see how much IO bandwith they have and compare it to your actual environment. In most cases you don't have to worry about it. It was a problem 10-15 years ago, not now.

When doing RAID in ZFS you've also get end-to-end checksumming and self-healing for all of your data.

Now there're workloads when HW RAID actually makes sense. First RAID-5 for random reads workload will generally work better on the array than RAID-Z. But still it's worth considering to just buy more disks in a JBOD then spending all the money on an array.

There're also some features like remote synchronous replication which some arrays do offer and in some environments its needed.

The real issue with ZFS right now is its hot spare support and disk failure recovery. Right now it's barely working and it's nothing like you are accustomed to in arrays. It's being worked on right now by ZFS team so I expect it to quickly improve. But right now if you are afraid of disk failures and you can't afford any downtimes due to disk failure you should go with HW RAID and possibly with ZFS as a file system. In such a scenario I also encourage to expose to ZFS at least 3 luns made of different disks/raid groups and do dynamic striping in ZFS - that way ZFS's meta data will be protected.

The other problem is that it is hard to find a good JBODs, especially from tier-1 vendors.
Then it's harder to find large JBODs (in terms of # of disks). Would be really nice to be able to buy SAS/SATA JBOD with ability to add many expansion units, with 4-8 ports to servers, supported in a cluster configs, etc. Maybe a JBOD with SAS 2,5" disks packed similar to x4500 - this would give enourmous IOPS/CAPACITY per 1U...

ps. and remember - even with ZFS there's no one hammer...

HW details:
System : Solaris 10 11/06 (kernel Generic_125101-04)
Server : x4100M2 2xOpteron 2218 2.6GHz dual-core, 16GB RAM, dual ported 4Gb Qlogic HBA
Array : Clariion CX3-40, 73GB 15K 4Gb disks
X4500 : 500GB 7200K, 2x Opteron 285 (2.6GHz dual-core), 16GB RAM


Anonymous said...

the images are broken

Kevin Closson said...


Can you please clarify something? In the "Hardware Raid" test I see you had 4 RAID-10 groups each being presented as a single LUN. You then use a ZFS stripe pool over all 4 LUNs. That sounds to me like a combination of hardrware and software RAID since RAID10 is both mirrored and striped and ZFS striping is RAID 0. Are you really comparing "Hardware RAID" versus "Software RAID" since both cases hve an amount of software RAID?

Also, you don't mention anything about stripe widths anywhere.

Anonymous said...

I agree with the previous poster. You're comparing a 40 disk RAID 10 setup with ZFS to what I guess could be called RAID 10+0 across 4 Clariion RAID 10 sets. I think a more apples-to-apples comparison would have been a 10 disk RAID 10 with ZFS vs. a 10 disk hardware RAID 10 on Clariion. Or go as wide as the Clariion would allow you to with RAID 10 (I don't know what that limit is) and match the same number of disks in your ZFS test.

Your results are interesting, but I'm not sure they say what you're implying as definitively as you're stating.

Anonymous said...

You mention that the workload was large enough to minimize filesystem caching being advantageous at the host. That would also mean you were filling the Clariion's write cache (which would be reasonable for your tests). I'm curious, were you clearing the Clariion's write cache between tests (disable then re-enable)? I ask because if you run one test after the other you can potentially still have a fair amount of data in the array's write cache from the previous test. When your subsequent test fills the remaining cache your performance will drop due to forced flushing. The result could be that whoever is tested first (or after the longest period of inactivity) would see better performance.

penguin said...

Hi Robert,
I'm interested to build a NAS gateway from Solaris machine. I find NetApp FAS is too expensive. I'm thinking to use Sun X2200 for the server (it can have lots of RAM for caching purpose) and connect it to JBOD (taking SATA disks, and using SAS connection to server). As you commented out in the last part of you blog, I can't find a JBOD that will work with Solaris. Dell PowerVault MD-1000 won't work with LSI-SAS3801E. Sun ST2500 only take SAS disk, no SATA support. Do you know any JBOD that will work nicely with Solaris? I also hope to see the multi-path, hot-swap, and hot-spare capability being fixed up soon.

Anonymous said...

I'm in the same boat with Penguin. Finding a good reliable JBOD is on the top of my wishlist.

Mark said...

I would like to see someone build an x86 based raid controller using Solaris and ZFS, connected to some Fibre JBODs, pushing out either iSCSI or FC blocks. Unfortunately, Sun did not include its SCSI Target Emulation (STE) in the recent OpenSolaris storage code contribution.

milek said...

Kevin - yep, in a way it's SW and HW RAID but this actually provides better performance 'coz if you create one large raid-10 on clarrion you have to assign it to one SP and your numbers drops (both SP and single HBA become bottleneck). So I assigned two LUNs to one SP and another two to the other and then stripped over them - performance numbers in some cases doubled.

Penguin - if you're looking for a NAS server in a non-clustered configuration then check Sun's x4500 server - 4U, 4x Opteron Cores, 32GB RAM, 48x SATA 500GB, 2x dual ported GbE - and it's cheap (Sun gives huge discounts on x4500). I've got plenty of these servers.

Mark - iSCSI target is for some time in Open Solaris and will be in a Next Solaris update planned in July this year.

Anonymous said...

This comment about the configuration, "In case of software raid, the same disks were used but each disk was presented as individual disk by Clarrion - 20 disks from one storage processor and 20 from the other. Then one large RAID-10 ZFS pool was created (the same disk pairs as with HW RAID)" is difficult to understand. I think you used the word "disk" about 14 times too many - what are you referring to - a LUN or a RAID Group? The SW configuration is not clear at all. Also what happens to CPU when ZFS is doing the SW RAID?

And about the comment looking for cheap disk - if you want to use the same disk as NetApp, then look at Xyratek - they build the disk that NetApp & others use, and it is commercially available.

RobP said...

A performance study here is not reason enough for anyone to go buy JBOD & put all of one's confidence into ZFS, which is what you are attempting to convince everyone of.

Let's remember that the SAS/SATA drives you buy at Frye's is not the same disks, interfaces & firmware used in the high-end vendor arrays. Those arrays have much more resiliency built-in such as armature exercising, read-after-write checking, idle seek, full sweep, and head unload. The disks are also dual-link ported to eliminate loss of connection when a loop fails. And, when does ZFS/SVM decide to fail/spare a disk? High-end array vendors do this upon soft fails in order to eliminate the major overhead induced when a hard failure hits (requiring rebuild from parity). How does ZFS/SVM deal with a double-disk failure? RAID-6?

Also in your study you are assigning ONE LUN to each RAID Group. This is unrealistic. With disk drives reaching 1 TB in size, one would have so much wasted space in a given ZFS filesystem (unless you had a myriad of subdirs) that it would be less costly overall to buy a better disk array that can slice-up the RAID Groups into reasonably sized LUNs so that diverse workloads can leverage the whole capacity. Prefetching, MRU/LRU algorithms in HW-RAID arrays are much more advanced than any filesystem.

If everyone subscribed to this reasoning, we'd be back in the dark ages with one JBOD array for every server.

Here's a better test for you - take your 4 x RAID Groups, make 20 LUNs on each, map them through a FC switch to 10 hosts (Solaris with ZFS), and run dissimilar workloads against each at the same time. Now that is reality!


Wojciech said...

where are all the pictures gone?

"When was last time you had to call for EMC to reconfigure your array and yet pay for it?"
Come on, its a Clariion, it takes a couple of clicks in the Navisphere to get what you want.

Anonymous said...

this is another one-sided, ridiculous performance study. for every one like this, EMC, IBM, HDS, HP - and yes, Sun, who sells LOTS of disk storage arrays, can provide you with a study showing why & how their arrays out-perform ZFS RAID.

nice try, but I'm not throwing out my Clariion or DMX for a JBOD array - this ain't the 1980's!

anne said...

La gran ventaja de los sistemas raid, pasa por la redundancia manteniendo los tiempos de transferencia, por eso los niveles 0 de raid no son los mas efectivos. Pero ojo, que a veces los raid fallan y la recuperaciĆ³n de sus datos se puede convertir en una pesadilla. Si en un momento dado necesitais recuperar datos de varios discos duros en raid os recomiendo las siguiente web : http://www.lineared.com/es/recuperar/raid-discos-duros.htm

Anonymous said...

The DNS-1200 and the DNS-1400 from www.areasys.com have been certified on Solaris and the LSI 3801 using ZFS. The solution is mature. I reccommend trying one. Great success so far.

Dual Path I/O

Anonymous said...

RobP, by trying to sound smart, you just showed how clueless you are about ZFS. "RAID-6?" "How does ZFS determine when to fail a disk?" Please.

As for the convoluted "performance / algorithm" argument: Right now, you think that's "hardware" RAID you're using? Mm?