Tuesday, July 22, 2014

Massive msync() speed up on ZFS

MongoDB is using mmap() to access all its files. It also has a special thread which by default wakes up every 60s and calls msync() for all the mmap'ed files, one at a time. Initially when you start MongoDB all these msync's are fast, assuming there are no modifications. However if your server has hundreds of GBs of RAM (or more) and a database is also that big, all these msync()s are getting slower over time - they take longer the more data is cached in RAM. Eventually it can take even 50s or more for the thread to finish syncing all of the files even if there is nothing to write to disk. The actual problem is that while the thread is syncing all the files it holds a global lock and until it finishes the database is almost useless. If it takes 50s to sync all the files then the database can process requests only for 10s out of each 60s window...

If you have logging enabled in MongoDB you should see log entries like:

    Tue Apr 29 06:22:02.901 [DataFileSync] flushing mmaps took 33302ms for 376 files

On Linux this is much faster as it has a special optimization for such a case, which Solaris doesn't.
However Oracle fixed the bug sometime ago and now the same database reports:

    Tue Apr 29 12:55:51.009 [DataFileSync] flushing mmaps took 9ms for 378 files

This is over 3000x improvement!
The Solaris Bug ID is: 18658199 Speed up msync() on ZFS by 90000x with this one weird trick
which is fixed in Solaris 11.1 SRU21 and also Solaris 11.2

Note that the fix only improves a case when an entire file is msync'ed and the underlying file system is ZFS. Any application which has a similar behavior would benefit. For some large mappings, like 1TB, the improvement can even be 178,000x.


Anonymous said...

Does FreeBSD have such an optimization?

milek said...

I have no idea. But it is very easy to check. create two files, one 1000GB in size, the other 10240GB in size, then compile the below program and run it against each file and see what's the timing.
In case FreeBSD optimizes for both cases see if timings differ with the size of msynced regions.

# mkfile -n 10240g /var/tmp/file1
# mkfile -n 1000g /var/tmp/file2

# truss -t memcntl -E 2>&1 ./mmap_file_and_msync_it /var/tmp/file1 1000 | tail -1
18.2737 memcntl() = 0
# truss -t memcntl -E 2>&1 ./mmap_file_and_msync_it /var/tmp/file2 1000 | tail -1
0.0001 memcntl() = 0

int main(int argc, charg **argv) {
int fd = open(argv[1], O_RDONLY);
size_t size = (size_t)atoll(argv[2])*1024*1024*1024);
p = mmap((caddr_t) 0, size, PROT_READ, MAP_SHARED, fd, (off_t) 0);
msync(p, size, MS_SYNC);