Thursday, December 02, 2010

Linux, O_SYNC and Write Barriers

We all love Linux... sometimes it is better not to look under its hood though as you never know what you might find.

I stumbled across a very interesting discussion on a Linux kernel mailing list. It is dated August 2009 so you may have already read it.

There is a related RH bug.

I'm a little bit surprised by RH attitude in this ticket. IMHO they should have fixed it and maybe provide a tunable which would enable/disable new behavior instead of keeping the broken implementation. But at least in recent man pages they have clarified it in the Notes section of open(2):
"POSIX provides for three different variants of synchronized I/O, corresponding to the flags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31), Linux only implements O_SYNC, but glibc maps O_DSYNC and O_RSYNC to the same numerical value as O_SYNC. Most Linux file systems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a write to be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and metadata necessary to retrieve it to be on disk by the time the system call returns."

Then there is another even more interesting discussion about write barriers:
"All of them fail to commit drive caches under some circumstances;
even fsync on ext3 with barriers enabled (because it doesn't
commit a journal record if there were writes but no inode change
with data=ordered)."
and also this one:
"No, fsync() doesn't always flush the drive's write cache. It often
does, any I think many people are under the impression it always does, but it doesn't.

Try this code on ext3:

fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);

while (1) {
    char byte;
    usleep (100000);
    pwrite (fd, &byte, 1, 0);
    fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode has changed. The inode mtime is changed by write only with 1 second granularity. Without a journal commit, there's no barrier, which translates to not flushing disk write cache.

If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more. That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals. A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance. I'm not sure if ordered requests are actually implemented
by any drivers at the moment. If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend on the non-existence of block drivers which do ordered (not flush) barrier requests. But there's lots of things wrong with that. Not least, it sucks performance for database-like applications and virtual machines, a lot due to unnecessary seeks. That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need."
This is really scary. I wonder how many developers knew about it especially when coding for Linux when data safety was paramount. Sometimes it feels that some Linux developers are coding to win benchmarks and do not necessarily care about data safety, correctness and standards like POSIX. What is even worse is that some of them don't even bother to tell you about it in official documentation (at least the O_SYNC/O_DSYNC issue is documented in the man page now).


10 comments:

Anonymous said...

You are well known blind solaris preacher ;)

http://www.phoronix.com/scan.php?page=article&item=nexenta_30_perf&num=1

milek said...

Do you actually have anything to say regarding the entry or are you just trolling Mr. Anonymous?

And yes, insecure implementations might offer better performance :)

Anonymous said...

This is just what usually preachers do. Less facts more myths...

Waste of time :(

trasz said...

Well, citing Phoronix usually means that the comment author is either troll, or a kid.

Stefan Parvu said...

Hi,

2.6.33+ adds support for O_DSYNC. O_RSYNC probable will be supported on future versions of Linux.

From "The Linux Programming Interface" book, page 243:

"Starting with kernel 2.6.33, Linux implements O_DSYNC...

I would not call this scary, but simple a slow progress in certain areas. Linux is a mature open source project which tries to improve over time. It differs than Solaris, AIX where corporations simple put $$$ and look for revenue and
profits. We all have seen the results of
this equation with Sun.

milek said...

Stefan - I think that you haven't really carefully read the blog entry and discussions it refers to.

Stefan Parvu said...

From your blog, and from Linux's open(2):

"POSIX provides for three different variants of synchronized I/O, corresponding to the flags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31), Linux only implements O_SYNC, but glibc maps O_DSYNC and O_RSYNC to the same numerical value as O_SYNC.

As you noted this was true for kernels 2.6.31. However O_DSYNC is fixed in 2.6.33+

As well O_RSYNC will be available under future kernel versions.

milek said...

Stefan - I must say I haven't looked closer at 2.6.33+ yet although there are separate O_DSYNC and O_SYNC flags now I don't think that O_SYNC is actually implemented by most (any?) Linux file-systems.

Then see my 2nd part of the blog entry - this is the real scary thing.

Also in regards to your previous comment - it's not that it is about slower progress, as sometimes it is true and sometimes it is not. It is more about correctness and honest documentation. Many people, including me, thought that write barriers and fsync(), etc. have been implemented correctly and they do what they are supposed to do and what is documented. Now I don't mind bugs... happens to every software. The scary thing here is that it is not a bug, it is considered a feature. If I use a proper API to request synchronous semantics I expect an OS to obey it, even if it means it is slower. After all if I wouldn't care about data I wouldn't use synchronous I/O. I would definitely not want an OS to do any "optimizations" to make things faster if it potentially risks data and lies to an application.

Linuxhippy said...

> The scary thing here is that it is not
> a bug, it is considered a feature

It is considered a bug, and thats why it is beeing worked on.

FSync behaviour is a bit messy on all OS's, take a look at discussions that took/take place e.g. on the postgresql development mailing lists.

- Clemens

Anonymous said...

See here what Ted Tso, creator of ext4, says:

http://phoronix.com/forums/showthread.php?36507-Large-HDD-SSD-Linux-2.6.38-File-System-Comparison&p=181904#post181904