Tuesday, May 04, 2010

ZFS - synchronous vs. asynchronous IO

Sometimes it is very useful to be able to disable a synchronous behavior of a filesystem. Unfortunately not all applications provide such functionality. With UFS many used fastfs from time to time, however the problem is that it can potentially lead to a filesystem corruption. In case of ZFS many people have been using an undocumented zil_disable tunable. While it can cause a data corruption from an application point of view it doesn't impact ZFS on-disk consistency. This is good as it makes the feature very useful, with a much smaller risk but can greatly improve a performance in some cases like database imports, nfs servers, etc. The problem with the tunable is that it is unsupported, has a server-wide impact and affects only newly mounted zfs filesystems while has an instant effect on zvols.

From time to time there were requests here and there to get it implemented properly in a fully supported way. I thought it might be a good opportunity to re-fresh my understanding of Open Solaris and ZFS internals so a couple of months ago I decided to implement it under: 6280630 zil synchronicity.
And it was a fun - I really enjoyed it. I spent most of the time trying to understand the interactions between ZIL/VNODE/VFS layers and the structure of ZFS code. I was already familiar with it to some extend as I contributed a code to ZFS in the past and I also do read the code from time to time when I do some performance tuning, etc. Once I understood what's going on there it was really easy to do the actual coding. Once I got a basic functionality working and I asked for a sponsor so it gets integrated. Tim Haley offered to sponsor me and help me to get it integrated. Couple of moths later, after a PSARC case, code reviews, email exchanges, testing it got finally integrated and should appear in build 140.

I would like to thank Tim Haley, Mark Musante and Neil Perin for all their comments, code reviews, testing, PSARC case handling, etc. It was a real pleasure to work with you.


PSARC/2010/108 zil synchronicity

ZFS datasets now have a new 'sync' property to control synchronous behavior.
The zil_disable tunable to turn synchronous requests into asynchronous requests (disable the ZIL) has been removed. For systems that use that switch on upgrade you will now see a message on booting:

sorry, variable 'zil_disable' is not defined in the 'zfs' module

Please update your system to use the new sync property.
Here is a summary of the property:

-------

The options and semantics for the zfs sync property:

sync=standard
This is the default option. Synchronous file system transactions
(fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
and then secondly all devices written are flushed to ensure
the data is stable (not cached by device controllers).

sync=always
For the ultra-cautious, every file system transaction is
written and flushed to stable storage by a system call return.
This obviously has a big performance penalty.

sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds. This option gives the
highest performance. However, it is very dangerous as ZFS
is ignoring the synchronous transaction demands of
applications such as databases or NFS.
Setting sync=disabled on the currently active root or /var
file system may result in out-of-spec behavior, application data
loss and increased vulnerability to replay attacks.
This option does *NOT* affect ZFS on-disk consistency.
Administrators should only use this when these risks are understood.

The property can be set when the dataset is created, or dynamically, and will take effect immediately. To change the property, an administrator can use the standard 'zfs' command. For example:

# zfs create -o sync=disabled whirlpool/milek
# zfs set sync=always whirlpool/perrin


Have a fun!

10 comments:

Anonymous said...

That's nice. Would it be feasible to add a fourth option that is like disabled, but which queues an immediate (async) transaction commit, rather than wait for the existing heuristics to kick in and fire one?

milek said...

if it would be an async request it wouldn't make any difference, would it?

Anonymous said...

It would reduce the window during which the real data is not persisted, to be 'as soon as the disk can manage'. Its not going to give you a guarantee (that's a big strong word!) but be a bit more predictable.

milek said...

if you need it then don't use sync=disabled and you do have full control of what needs to be synchronous and what doesn't.

Anonymous said...

How about sync=onclose
Every time a file open for writing is closed it syncs ie equvalent to fsync() followed by close() for a partictular file system.
OR
sync=closewait
close() waits for the writes to be completed but does not force a flush.

The idea is to ensure that when a programe which includes cp, cat, awk and many more only return when the data is safely on the disk, and so if the system crashes, and a batch process (managed remotely) restarts, it has a clear restart.

Anonymous said...

Does this need a new zpool (zpool upgrade -v) / zfs (zfs upgrade -v) version ?

milek said...

no, it doesn't need zpool/zfs upgrade.

gil said...

Mike, I'm building my first production ZFS NAS and it is suffering from the infamous slow write speeds over NFS. After a lot of research, I decided to take my changes and run with zil_disable. I understand that ZFS is safe, but the client may suffer corruption.

However, is there a way to configure ZFS so that ZIL cache is ON and NFS sync is ignored?

Unfortunately, in my production environment, VMware's implementation of NFS client includes O_SYNC. Here is a quick run down of my write speeds over 1 gigabit ethernet:

scp write speed: 55 MB/sec
NFS write w/sync: 12 MB/sec
NFS write w/async: 28 MB/sec zil_disable

--Gil Vidals / VM Racks

geppi said...

From the description of the "sync=standard" setting I would expect that when using a SSD as a dedicated log device for the intent log, all data will be flushed to stable NAND Flash memory before the operation is committed.
If this is the case I wonder if it is required that the SSD provides a super capacitor or any other means to assure that data in its volatile DRAM buffer is written to stable NAND Flash memory in case of a power loss.
As far as I understand the synchronous writes implementation in ZFS any SSD should be OK that honors the flush command. Or am I missing something because there are a lot of sources on the internet that say that an SSD which is used as a ZIL device should have a super capacitor ?

milek said...

geppi - if an SSD does indeed ibey the flush command then yes, it should be fine running without super capacitor. On the other hand, devices with super capacitor might be faster, as they do not have to obey the flush commands and therefore can hide latency behind DRAM and behind aggregating multiple writes into larger I/O.