Robert Milkowski's blog

Tuesday, May 04, 2010

ZFS - synchronous vs. asynchronous IO

Sometimes it is very useful to be able to disable a synchronous behavior of a filesystem. Unfortunately not all applications provide such functionality. With UFS many used fastfs from time to time, however the problem is that it can potentially lead to a filesystem corruption. In case of ZFS many people have been using an undocumented zil_disable tunable. While it can cause a data corruption from an application point of view it doesn't impact ZFS on-disk consistency. This is good as it makes the feature very useful, with a much smaller risk but can greatly improve a performance in some cases like database imports, nfs servers, etc. The problem with the tunable is that it is unsupported, has a server-wide impact and affects only newly mounted zfs filesystems while has an instant effect on zvols.

From time to time there were requests here and there to get it implemented properly in a fully supported way. I thought it might be a good opportunity to re-fresh my understanding of Open Solaris and ZFS internals so a couple of months ago I decided to implement it under: 6280630 zil synchronicity.

And it was a fun - I really enjoyed it. I spent most of the time trying to understand the interactions between ZIL/VNODE/VFS layers and the structure of ZFS code. I was already familiar with it to some extend as I contributed a code to ZFS in the past and I also do read the code from time to time when I do some performance tuning, etc. Once I understood what's going on there it was really easy to do the actual coding. Once I got a basic functionality working and I asked for a sponsor so it gets integrated. Tim Haley offered to sponsor me and help me to get it integrated. Couple of moths later, after a PSARC case, code reviews, email exchanges, testing it got finally integrated and should appear in build 140.

I would like to thank Tim Haley, Mark Musante and Neil Perin for all their comments, code reviews, testing, PSARC case handling, etc. It was a real pleasure to work with you.

PSARC/2010/108 zil synchronicity

ZFS datasets now have a new 'sync' property to control synchronous behavior.
The zil_disable tunable to turn synchronous requests into asynchronous requests (disable the ZIL) has been removed. For systems that use that switch on upgrade you will now see a message on booting:


sorry, variable 'zil_disable' is not defined in the 'zfs' module

Please update your system to use the new sync property.
Here is a summary of the property:

-------

The options and semantics for the zfs sync property:


sync=standard
  This is the default option. Synchronous file system transactions
  (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
  and then secondly all devices written are flushed to ensure
  the data is stable (not cached by device controllers).

sync=always
  For the ultra-cautious, every file system transaction is
  written and flushed to stable storage by a system call return.
  This obviously has a big performance penalty.

sync=disabled
  Synchronous requests are disabled.  File system transactions
  only commit to stable storage on the next DMU transaction group
  commit which can be many seconds.  This option gives the
  highest performance.  However, it is very dangerous as ZFS
  is ignoring the synchronous transaction demands of
  applications such as databases or NFS.
  Setting sync=disabled on the currently active root or /var
  file system may result in out-of-spec behavior, application data
  loss and increased vulnerability to replay attacks.
  This option does *NOT* affect ZFS on-disk consistency.
  Administrators should only use this when these risks are understood.

The property can be set when the dataset is created, or dynamically, and will take effect immediately. To change the property, an administrator can use the standard 'zfs' command. For example:


 # zfs create -o sync=disabled whirlpool/milek
 # zfs set sync=always whirlpool/perrin

Have a fun!

Thursday, April 29, 2010

Gartner on Oracle/Sun New Support Model

"Oracle announces its new Sun support policy that has the potential to radically change the way in which OEMs offer support and may make third-party maintenance offerings for Sun hardware unprofitable."

Tuesday, April 27, 2010

Memory DeDuplication in Linux

Saturday, April 24, 2010

Oracle Solaris on HP ProLiant Servers

Recently there has been lots of confusion regarding running Solaris 10 on non-Sun servers.

HP Oracle Solaris 10 Subscriptions and Support:

"Certifying Oracle Solaris on ProLiant servers since 1996, HP is expanding its relationship with Oracle to include selling Oracle Solaris 10 Operating System Subscriptions and support from HP Technology Services on certified ProLiant servers.

HP will provide the subscriptions and support for the Oracle Solaris 10 Operating System on certified ProLiant servers and Oracle will provide patches and updates directly to HP's customers through Oracle SunSolve.

As part of this expanded relationship HP and Oracle will work together to enhance the customer experience for Oracle Solaris on ProLiant servers and HP increase its participation in the OpenSolaris community."

And of course you can also run Open Solaris on any x86 hardware, including HP servers, entirely for free if you want. I wonder though if it would make sense for HP to also offer support for Open Solaris - more and more customers are deploying Open Solaris instead of Solaris 10 on their servers and Oracle already offers a support for it on their own servers.

Monday, March 29, 2010

ZFS diff

PSARC/2010/105:


        There is a long-standing RFE for zfs to be able to describe
        what has changed between the snapshots of a dataset.
        To provide this capability, we propose a new 'zfs diff'
        sub-command.  When run with appropriate privilege the
        sub-command describes what file system level changes have
        occurred between the requested snapshots.  A diff between the
        current version of the file system and one of its snapshots is
        also supported.

        Five types of change are described:

        o    File/Directory modified
        o    File/Directory present in older snapshot but not newer
        o    File/Directory present in newer snapshot but not older
        o    File/Directory renamed
        o    File link count changed

      zfs diff snapshot  snapshot | filesystem

         Gives a high level description of the differences between a
         snapshot and a descendant dataset.  The descendant may either
         be a later snapshot of the dataset or the current dataset.
         For each file system object that has undergone a change
         between the original snapshot and the descendant, the type of
         change is described along with the name of the file or
         directory.  In the case of a rename, both the old and new
         names are shown.

         The type of change is described with a single character:

         +   Indicates the file/directory was added in the later dataset
         -   Indicates the file/directory was removed in the later dataset
         M   Indicates the file/directory was modified in the later dataset
         R   Indicates the file/directory was renamed in the later dataset

        If the modification involved a change in the link count of a
        file, the change will be expressed as a delta within
        parentheses on the modification line.  Example outputs are
        below:

         M       /myfiles/
         M       /myfiles/link_to_me   (+1)
         R       /myfiles/rename_me -> /myfiles/renamed
         -       /myfiles/delete_me
         +       /myfiles/new_file

Saturday, March 27, 2010

Project Brussels Phase II

This project introduces a new CLI utility called ipadm(1M) that can be used to perform:

* IP interfaces management (creation/deletion)
* IP address management (add, delete, show) for static IPv4& IPv6 addresses, DHCP, stateless/stateful IPv6 Address configuration
* protocol (IP/TCP/UDP/SCTP/ICMP) tunable management (set, get, reset) global (ndd(1M)) tunables, as well as per-interface tunables
* provide persistence for all of the three features above so that on reboot the configuration is reapplied

Please see the case materials of PSARC 2010/080 for the latest design document and read ipadm(1M) man page for more information.

This has been integrated into Open Solaris.

Friday, March 26, 2010

CPU/MEM HotPlug on x86 in Open Solaris

The integration of:

 PSARC/2009/104 Hot-Plug Support for ACPI-based Systems
 PSARC/2009/550 PSMI extensions for CPU Hotplug
 PSARC/2009/551 acpihpd ACPI Hotplug Daemon
 PSARC/2009/591 Attachment Points for Hotpluggable x86 systems
 6862510 provide support for cpu hot add on x86
 6874842 provide support for memory hot add on x86
 6883891 cmi interface needs to support dynamic reconfiguration
 6884154 x2APIC and kmdb may not function properly during CPU hotplug event.
 6904971 low priority acpi nexus code review feedback
 6877301 lgrp should support memory hotplug flag in SRAT table

Introduces support for hot-adding cpus and memory to a running Xeon 7500 platform.

Sunday, February 28, 2010

ReadyBoost

I didn't know that Windows has a similar technology to ZFS L2ARC which is called ReadyBoost. Nice.

I'm building my new home NAS server and I'm currently seriously considering putting OS on an USB pen drive leaving all sata disks for data only. It looks like with modern USB drives OS should actually boot faster than from a sata disks thanks to much better seek times. I'm planning on doing some experiments first.

Thursday, February 25, 2010

ZVOLs - Write Cache

When you create a ZFS volume its write cache is disabled by default meaning that all writes to the volume will be synchronous. Sometimes it might be handy though to be able to enable a write cache for a particular zvol. I wrote a small C program which allows you to check if WC is enabled or not. It also allows you to enable or disable write cache for a specified zvol.

First lets check if write cache is disabled for a zvol rpool/iscsi/vol1


milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1
Write Cache: disabled

Now lets issue 1000 writes


milek@r600:~/progs# ptime ./sync_file_create_loop /dev/zvol/rdsk/rpool/iscsi/vol1 1000

real       12.013566363
user        0.003144874
sys         0.104826470

So it took 12s and I also confirmed that writes were actually being issued to a disk drive. Lets enable write cache now and repeat 1000 writes


milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1 1
milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1
Write Cache: enabled

milek@r600:~/progs# ptime ./sync_file_create_loop /dev/zvol/rdsk/rpool/iscsi/vol1 1000

real        0.239360231
user        0.000949655
sys         0.019019552

Worked fine.

The zvol_wce program is not idiot-proof and it doesn't check if operation succeeded or not. You should be able to compile it by issuing: gcc -o zvol_wce zwol_wce.c


milek@r600:~/progs# cat zvol_wce.c

/* Robert Milkowski
  http://milek.blogspot.com
*/

#include <unistd.h>
#include <stropts.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stropts.h>
#include <sys/dkio.h>


int main(int argc, char **argv)
{
 char *path;
 int wce = 0;
 int rc;
 int fd;

 path = argv[1];

 if ((fd = open(path, O_RDONLY|O_LARGEFILE)) == -1)
   exit(2);

 if (argc>2) {
   wce = atoi(argv[2]) ? 1 : 0;
   rc = ioctl(fd, DKIOCSETWCE, &wce);
 }
 else {
   rc = ioctl(fd, DKIOCGETWCE, &wce);
   printf("Write Cache: %s\n", wce ? "enabled" : "disabled");
 }

 close(fd);
 exit(0);
}

Tuesday, February 16, 2010

60 Disks in 4U

Friday, February 12, 2010

Chip and PIN is Broken

Wednesday, February 10, 2010

Dell - No 3rd Party Disk Drives Allowed

Third-party drives not permitted:

"[...]
Is Dell preventing the use of 3rd-party HDDs now?
[....]
Howard_Shoobe at Dell.com:
Thank you very much for your comments and feedback regarding exclusive use of Dell drives. It is common practice in enterprise storage solutions to limit drive support to only those drives which have been qualified by the vendor. In the case of Dell's PERC RAID controllers, we began informing customers when a non-Dell drive was detected with the introduction of PERC5 RAID controllers in early 2006. With the introduction of the PERC H700/H800 controllers, we began enabling only the use of Dell qualified drives. There are a number of benefits for using Dell qualified drives in particular ensuring a positive experience and protecting our data. While SAS and SATA are industry standards there are differences which occur in implementation. An analogy is that English is spoken in the UK, US and Australia. While the language is generally the same, there are subtle differences in word usage which can lead to confusion. This exists in storage subsystems as well. As these subsystems become more capable, faster and more complex, these differences in implementation can have greater impact. Benefits of Dell's Hard Disk and SSD drives are outlined in a white paper on Dell's web site at http://www.dell.com/downloads/global/products/pvaul/en/dell-hard-drives-pov.pdf"

I understand they won't support 3rd party disk drives but blocking a server (a RAID card) from using such disks is something new - an interesting comment here.

IBM 2010: Customers in Revolt

From my own experience their sales people are very aggressive with an attitude of sale first and let someone else worry later. While I always take any vendor claims with a grain of salt I learnt to double or even triple check any IBM's claims.

Tuesday, February 09, 2010

Power 7

Now lets wait for some benchmarks. I only wish Solaris was running on them as well as right now you need to go the legacy AIX route or not so mature Linux route - not an ideal choice.

Thursday, February 04, 2010

Data Corruption - ZFS saves the day, again

We came across an interesting issue with data corruption and I think it might be interesting to some of you. While preparing a new cluster deployment and filling it up with data we suddenly started to see below messages:


XXX cl_runtime: [ID 856360 kern.warning] WARNING: QUORUM_GENERIC: quorum_read_keys error:
 Reading the registration keys failed on quorum device /dev/did/rdsk/d7s2 with error 22.

The d7 quorum device was marked as being offline and we could not bring it online again. There isn't much in documentation about the above message except that it is probably a firmware problem on a disk array and we should contact a vendor. But lets investigate first what is really going on.

By looking at the source code I found that the above message is printed from within quorum_device_generic_impl::quorum_read_keys() and it will only happen if quorum_pgre_key_read() returns with return code 22 (actually any other than 0 or EACCESS but from the syslog message we already suspect that the return code is 22).

The quorum_pgre_key_read() calls quorum_scsi_sector_read() and passes its return code as its own. The quorum_scsi_sector_read() will return with an error only if quorum_ioctl_with_retries() returns with an error or if there is a checksum mismatch.

This is the relevant source code:


406 int
407 quorum_scsi_sector_read(
[...]
449  error = quorum_ioctl_with_retries(vnode_ptr, USCSICMD, (intptr_t)&ucmd,
450      &retval);
451  if (error != 0) {
452   CMM_TRACE(("quorum_scsi_sector_read: ioctl USCSICMD "
453       "returned error (%d).\n", error));
454   kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
455   return (error);
456  }
457
458  //
459  // Calculate and compare the checksum if check_data is true.
460  // Also, validate the pgres_id string at the beg of the sector.
461  //
462  if (check_data) {
463   PGRE_CALCCHKSUM(chksum, sector, iptr);
464
465   // Compare the checksum.
466   if (PGRE_GETCHKSUM(sector) != chksum) {
467    CMM_TRACE(("quorum_scsi_sector_read: "
468        "checksum mismatch.\n"));
469    kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
470    return (EINVAL);
471   }
472
473   //
474   // Validate the PGRE string at the beg of the sector.
475   // It should contain PGRE_ID_LEAD_STRING[1|2].
476   //
477   if ((os::strncmp((char *)sector->pgres_id, PGRE_ID_LEAD_STRING1,
478       strlen(PGRE_ID_LEAD_STRING1)) != 0) &&
479       (os::strncmp((char *)sector->pgres_id, PGRE_ID_LEAD_STRING2,
480       strlen(PGRE_ID_LEAD_STRING2)) != 0)) {
481    CMM_TRACE(("quorum_scsi_sector_read: pgre id "
482        "mismatch. The sector id is %s.\n",
483        sector->pgres_id));
484    kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
485    return (EINVAL);
486   }
487
488  }
489  kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
490
491  return (error);
492 }

With a simple DTrace script I could verify if the quorum_scsi_sector_read() does indeed return with 22 and also I could print what else is going on within the function:


56  -> __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 6308555744942019 enter
56    -> __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 6308555744957176 enter
56    <- __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 6308555745089857 rc: 0 
56    -> __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745108310 enter
56      -> __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745120941 enter
56        -> __1cCosHsprintf6FpcpkcE_v_      6308555745134231 enter
56        <- __1cCosHsprintf6FpcpkcE_v_      6308555745148729 rc: 2890607504684 
56      <- __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745162898 rc: 1886718112 
56    <- __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745175529 rc: 1886718112 
56  <- __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 6308555745188599 rc: 22

From the above output we know that the quorum_ioctl_with_retries() returns with 0 so it must be a checksum mismatch! As CMM_TRACE() is being called above and there are only three of them in the code lets check with DTrace which one it is:


21  -> __1cNdbg_print_bufIdbprintf6MpcE_v_   6309628794339298 quorum_scsi_sector_read: checksum mismatch.

So now I knew exactly what part of the code is casing the quorum device to be marked offline. The issue might have been caused by many things like: a bug in a disk array firmware, a problem on an SAN, a bug in a HBA's firmware, a bug in a qlc driver or a bug in SC software, or... However because the issue suggests a data corruption and we are loading the cluster with a copy of a database we might have a bigger issue that just an offline quorum device. The configuration is a such that we are using ZFS to mirror between two disks arrays. We have been restoring a couple of TBs of data into and we haven't read almost anything back. Thankfully it is ZFS so we might force a re-check off all data in the pool and I did. ZFS found 14 corrupted blocks and even identified which file is affected. The interesting thing here is that for all blocks both copies on both sides of the mirror were affected. This almost eliminates a possibility of a firmware problem on disk arrays and suggest that the issue was caused by something misbehaving on the host itself. There is still a possibility of an issue on SAN as well. It is very unlikely to be a bug in ZFS as the corruption affected reservation keys as well which has basically nothing to do with ZFS at all. Then we are still writing more and more data into the pool and I'm repeating scrubs and I'm not getting any new corrupted blocks nor quorum is misbehaving (I fixed it by temporarily adding another one, removing the original and re-adding it again while removing the temporary one).

While I still have to find what caused the data corruption the most important thing here is ZFS. Just think about it - what would happen if we were running on any other file system like: UFS, VxFS, ext3, ext4, JFS, XFS, ... Well, almost anything could have happened with them like some data of could be corrupted, some files lost, system could crash, fsck could be forced to run for many hours and still not being able to fix the filesystem and it definitely wouldn't be able to detect any data corruption withing files or everything would be running fine for days, months and then suddenly the system would panic, etc. when application would try to access the corrupted blocks for the first time. Thanks to ZFS what have actually happened? All corrupted blocks were identified, unfortunately both mirrored copies were affected so ZFS can't fix them but it did identified a single file which was affected by all these blocks. We can just remove the file which is only 2GB and restore it again. And all of these while the system was running and we haven't even stopped the restore or didn't have to start from the beginning. Most importantly there is no uncertainty about the state of the filesystem or data within it.

The other important conclusion is that DTrace is a sysadmin's best friend :)