Tuesday, February 19, 2008

2530 Array and ZFS

ZFS by default flushes disk cache after each transaction completes or after each synchronous operation completes so it assures all data which is supposed to be on physical disk is actually there. This is good. However if you use ZFS with some external array with battery backed up cache then with some arrays array will flush its entire cache to disks every time ZFS sends a scsi flush command. Because your cache in your array is usually mirrored and battery backed up you don't want that to happen as it usually will affect badly array's performance. Some arrays just ignore these scsi flush commands but some of them not. ZFS team is working on that problem so ZFS sends different scsi flush command which says to flush cache only if it's not protected and not to send scsi flush commands at all to some arrays.
Currently if you are hit with the problem you can configure your array so it ignores cache flushes or you can configure ZFS to not to send scsi flushes at all. Check ZFS Evil tuning guide for more information.

Sun SAS 2530 array is one of this arrays which will flush cache when asked to.
Let's wrtite a simple C program which will create a new file with O_DSYNC flag set, write 255 bytes, close the file then delete it. Then it repeats it N times. Then lets compare what is the time difference when run on ZFS file system on 2530 array with ZFS set to send cache flushes (default) and when ZFS is set to not to send them.



# echo zfs_nocacheflush/D | mdb -k
zfs_nocacheflush:
zfs_nocacheflush: 0
[default, zfs will send cache flush commands]

# ./filesync-1 /slave/tmp 10000
Time in seconds to create and unlink 10000 files with O_DSYNC: 59.041564

[let's dynamically turn off cache flushes by zfs]
# echo zfs_nocacheflush/W1 | mdb -kw
zfs_nocacheflush: 0 = 0x1

# ./filesync-1 /slave/tmp 10000
Time in seconds to create and unlink 10000 files with O_DSYNC: 7.050389
We get over 8x performance improvement!
With multiple streams we probably would get ever bigger improvement.

To permanently disable ZFS cache flushes put in /etc/system

set zfs:zfs_nocacheflush = 1

Remember that it will disable cache flushes to ALL zfs file systems in your system.

Below is a source code for filesync-1 program. Remember - it's a quick program written in 1 minute to just make a quick test, it's definitely far from beautiful coding.

#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/nameser.h>
#include <resolv.h>
#include <netdb.h>
#include <sys/time.h>
#include <inttypes.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
char *path;
char filename[256];
int n_req;
int i, fd;
hrtime_t stopper_start, stopper_stop;

if(argc != 3)
{
printf("Usage: %s path N\n", argv[0]);
exit(1);
}

path = argv[1];
n_req = atoi(argv[2]);

stopper_start = gethrtime();
i=0;
while (i++ < n_req)
{
strcpy(filename, path);
strcat(filename, "/filesync_test");
fd = open(filename, O_CREAT|O_RDWR|O_DSYNC);
write (fd, filename, 255);
close(fd);
unlink(filename);
}
stopper_stop = gethrtime();

printf("Time in seconds to create and unlink %d files with O_DSYNC: %f\n\n", n_req, (float)(stopper_stop - stopper_start)/1000000000);

exit(0);
}

4 comments:

Anonymous said...

Nice entry in your blog. Unfortunately #includes parentheses happened do decode a wrong way, so we can't see what is "included" until looking into html source :)

milek said...

Thanks. Corrected.

Michel said...

Thanks for the article.
Speaking of 2530 and ZFS, would you recommend ZFS handle mirroring or do it through the 2530 hardware?
ZFS recommends direct access to the drives, the 2530 recommends using internal raid hardware...

FooDog said...

tested on Sun StorageTek 6540 Array.
resulting times were the same.