Friday, November 09, 2012

vmtasks explained

Solaris 11 introduced a new kernel process called "vmtasks" which accelerates some operations when working with shared memory. For more details see here.

20 Years of Solaris

Nice video from Oracle celebrating 20 years of Solaris.

Tuesday, October 23, 2012

Running OpenAFS on Solaris 11 x86 / ZFS

Recently I gave a talk on running OpenAFS services on top of Solaris 11 x86 / ZFS. The talk was split in two parts - first part about $$ benefits of transparent ZFS compression, when running on 3rd party x86 hardware (but it also makes sense when running on Sun/Oracle kit - in some cases even more so). This part also discusses some ideas about running AFS on internal disks instead of directly attached disk arrays, which again, thanks to ZFS built-in compression makes it worthwhile and  delivers even more $$ savings.

The main message of this part is, that if your data compresses well (above 2x), running OpenAFS on ZFS can deliver similar or even better performance but most importantly it can save you lots of $$, both in acquisition costs, and in cost of running AFS plant. In most cases you should even be able to re-use the current x86 hardware you have. The beauty of AFS is, that we were able to migrate data from Linux to Solaris/ZFS, in-place, by re-using the same x86 HW, and all of this was completely transparent to all clients (keep in mind we are talking about PBs of data) - this is truly the cloud file system. I think OpenAFS is one of the under-appreciated technologies in the market.

The second part is about using DTrace, both in dev and in production systems, to find scalability and performance bottlenecks, and other bugs as well. Two easy and real-life examples are discussed, which resulted in considerable improvement in scalability and performance of some operations in OpenAFS, along with some other examples of D scripts which provide top-like output with some statistics (slide #32 is an example from a Solaris NFS server, serving VMWare clients and displaying different stats per VM from a single file system...). DTrace has proven to be a very powerful and helpful tool for us, although it is hard to put a specific $ value it brings.

The slides should be available here.

Wednesday, August 29, 2012

Open Indiana is dead

With the main guy behind the project resigning, OI is essentially dead. It's been dead for some time though and with no commercial backing it never really had much chance. This is sad news indeed (although I haven't really used OI). It marks the end of Open Solaris era.

Can Illumos survive in the long term? Can it become relevant outside of couple of niche use cases?

Ironically, it is Oracle's Solaris which will probably outlive all of them.

Friday, July 27, 2012

Locking a running process in RAM

Recently I was looking at a possibility of locking some or all memory mappings, for an already running process, in RAM. Why would you want to do it? There might be many reasons. One of them is to prevent a critical process to be swapped out.

Now, how can we do it without restarting a process or changing its code? There is memcntl() and similar calls like plock(), mlock(), etc. - but they only work for a process which is calling them. However libproc on Solaris allows you to run memcntl() in a context of other process, among many other cool things. By the way - libproc is used by tools like truss, ppgsz, etc.

First, lets see that it actually works. Here is a bash process running as a non-root user, see that no mappings are locked in RAM.
cwafseng3 $ pmap -ax $$
27709:  bash
 Address  Kbytes     RSS    Anon  Locked Mode   Mapped File
08050000     968     968       -       - r-x--  bash
08151000      76      76       8       - rwx--  bash
08164000     140     140      56       - rwx--    [ heap ]
F05D0000     440     440       -       - r-x--  libnsl.so.1
F064E000       8       8       4       - rw---  libnsl.so.1
F0650000      20      12       -       - rw---  libnsl.so.1
F0660000      56      56       -       - r-x--  libsocket.so.1
F067E000       4       4       -       - rw---  libsocket.so.1
FE560000      64      16       -       - rwx--    [ anon ]
FE577000       4       4       4       - rwxs-    [ anon ]
FE580000      24      12       4       - rwx--    [ anon ]
FE590000       4       4       4       - rw---    [ anon ]
FE5A0000    1352    1352       -       - r-x--  libc_hwcap1.so.1
FE702000      44      44      16       - rwx--  libc_hwcap1.so.1
FE70D000       4       4       -       - rwx--  libc_hwcap1.so.1
FE710000       4       4       -       - r-x--  libdl.so.1
FE720000       4       4       4       - rw---    [ anon ]
FE730000     184     184       -       - r-x--  libcurses.so.1
FE76E000      16      16       -       - rw---  libcurses.so.1
FE772000       8       8       -       - rw---  libcurses.so.1
FE780000       4       4       4       - rw---    [ anon ]
FE790000       4       4       4       - rw---    [ anon ]
FE7A0000       4       4       -       - rw---    [ anon ]
FE7AD000       4       4       -       - r--s-    [ anon ]
FE7B4000     220     220       -       - r-x--  ld.so.1
FE7FB000       8       8       4       - rwx--  ld.so.1
FE7FD000       4       4       -       - rwx--  ld.so.1
FEFFB000      16      16       4       - rw---    [ stack ]
-------- ------- ------- ------- -------
total Kb    3688    3620     116       -
Now, I will use the small tool I wrote, to lock all mappings with RX or RWX protections on them. The tool requires for a PID to be specified.
$ ./pr_memcntl 27709
pr_memcntl() failed: Not owner
Although I run it as root, it failed. Remember when I wrote a moment ago, that libproc would call memcntl() from a contetx of the target process? The target process here is bash with pid 27709 and it is running as a standard user, and by default a standard user cannot lock pages in memory. We can add the required privileges for locking pages in RAM. Lets see what's missing. I enabled privilege debugging for the bash process and run the pr_memcntl tool again
$ ppriv -D 27709
bash[27709]: missing privilege "proc_lock_memory"
            (euid = 145104, syscall = 131) needed at memcntl+0x140
Now, lets add the missing privilege to the bash process and then try again locking the mappings:
$ ppriv -s EP+proc_lock_memory 27709
$ ./pr_memcntl 27709
$
This time no error, lets see the pmap output again:
$ pmap -ax $$
27709:  bash
 Address  Kbytes     RSS    Anon  Locked Mode   Mapped File
08050000     968     968       -     968 r-x--  bash
08151000      76      76       8      76 rwx--  bash
08164000     140     140      48     140 rwx--    [ heap ]
F05D0000     440     440       -     440 r-x--  libnsl.so.1
F064E000       8       8       4       - rw---  libnsl.so.1
F0650000      20      12       -       - rw---  libnsl.so.1
F0660000      56      56       -      56 r-x--  libsocket.so.1
F067E000       4       4       -       - rw---  libsocket.so.1
FE560000      64      64       -      64 rwx--    [ anon ]
FE577000       4       4       4       - rwxs-    [ anon ]
FE580000      24      24       4      24 rwx--    [ anon ]
FE590000       4       4       4       - rw---    [ anon ]
FE5A0000    1352    1352       -    1352 r-x--  libc_hwcap1.so.1
FE702000      44      44      16      44 rwx--  libc_hwcap1.so.1
FE70D000       4       4       -       4 rwx--  libc_hwcap1.so.1
FE710000       4       4       -       4 r-x--  libdl.so.1
FE720000       4       4       4       - rw---    [ anon ]
FE730000     184     184       -     184 r-x--  libcurses.so.1
FE76E000      16      16       -       - rw---  libcurses.so.1
FE772000       8       8       -       - rw---  libcurses.so.1
FE780000       4       4       4       - rw---    [ anon ]
FE790000       4       4       4       - rw---    [ anon ]
FE7A0000       4       4       -       - rw---    [ anon ]
FE7AD000       4       4       -       - r--s-    [ anon ]
FE7B4000     220     220       -     220 r-x--  ld.so.1
FE7FB000       8       8       4       8 rwx--  ld.so.1
FE7FD000       4       4       -       4 rwx--  ld.so.1
FEFFB000      16      16       4       - rw---    [ stack ]
-------- ------- ------- ------- -------
total Kb    3688    3680     108    3588
It works! :)

Notice that only mappings with RX or RWX protections are locked (as hard-coded in the tool, but any mappings or entire process can be locked if desired). The tool can obviously easily be expanded so it automatically adds the privilege if needed.

The C code below is a prototype - it is not idiot proof nor does it handle all errors, and it could be more user friendly - but it works and it is trivial to extend it. It should work on Solaris 10 and Solaris 11 and also on all Illumos based distributions.

Notice, that it sets MCL_CURRENT|MCL_FUTURE flags, meaning that not only current rx|rwx mappings are locked but also all future mappings, with the rx|rwx protections set, will be locked as well. In order to compile the program you need libproc.h which is currently not distributed with Solaris (hopefully it will change soon). You can get a copy from here.

Some ideas on how the tool could be easily extended:
  • add an option to add the proc_lock_memory automatically (and perhaps update resource limit as well)
  • add an option to remove a lock on specific mapping or from all mappings
  • add an option to specify what should be lock
  • add options to specify if only current mappings and/or future mappings should be locked as well

ps. putting a process in RT class would achieve a similar result, although it wouldn't give a control over which mappings should be locked, and running in RT might not be desirable for other reasons as well
 

// gcc -m64 -lproc -I. -o pr_memcntl pr_memcntl.c
 
#include <sys/types.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
 
#include <libproc.h>


int main(int argc, char **argv) {
  pid_t pid;
  int perr;
  static struct ps_prochandle *Pr;
 
  pid = atoi(argv[1]);
 
  if((Pr = Pgrab(pid, PGRAB_NOSTOP, &perr)) == NULL) {
    printf("Pgrab() failed: %s\n", Pgrab_error(perr));
    exit(1);
  }
 
  if(pr_memcntl(Pr, 0, 0, MC_LOCKAS, (caddr_t)(MCL_CURRENT|MCL_FUTURE), PROC_TEXT, (int)0)) {
    perror("pr_memcntl() failed");
    Prelease(Pr, 0);
    exit(1);
  }
 
  if(pr_memcntl(Pr, 0, 0, MC_LOCKAS, (caddr_t)(MCL_CURRENT|MCL_FUTURE), PROC_TEXT|PROT_WRITE, (int)0)) {
    perror("pr_memcntl() failed");
    Prelease(Pr, 0);
    exit(1);
  }
 
  Prelease(Pr, 0);
  Pr = NULL;
 
  exit(0);
}

Wednesday, May 02, 2012

Physical disk locations

zpool(1M) has a very handy option to display physical disk locations on some hardware.
cwafseng3 $ zpool status -l cwafseng3-0
  pool: cwafseng3-0
 state: ONLINE
  scan: scrub canceled on Thu Apr 12 13:52:13 2012
config:
 
        NAME                                       STATE     READ WRITE CKSUM
        cwafseng3-0                                ONLINE       0     0     0
          raidz1-0                                 ONLINE       0     0     0
            /dev/chassis/i86pc.unknown/HDD02/disk  ONLINE       0     0     0
            /dev/chassis/i86pc.unknown/HDD23/disk  ONLINE       0     0     0
            /dev/chassis/i86pc.unknown/HDD22/disk  ONLINE       0     0     0
            /dev/chassis/i86pc.unknown/HDD21/disk  ONLINE       0     0     0
            /dev/chassis/i86pc.unknown/HDD20/disk  ONLINE       0     0     0
            /dev/chassis/i86pc.unknown/HDD19/disk  ONLINE       0     0     0
            /dev/chassis/i86pc.unknown/HDD17/disk  ONLINE       0     0     0
            /dev/chassis/i86pc.unknown/HDD15/disk  ONLINE       0     0     0
 
errors: No known data errors
cwafseng3 $
The HDDXX entries directly correspond to the physical slots, in this case in Sun x4270M2 server. Compare it to the standard output:
cwafseng3 $ zpool status cwafseng3-0
  pool: cwafseng3-0
 state: ONLINE
  scan: scrub canceled on Thu Apr 12 13:52:13 2012
config:

        NAME                       STATE     READ WRITE CKSUM
        cwafseng3-0                ONLINE       0     0     0
          raidz1-0                 ONLINE       0     0     0
            c3t5000CCA00AC87F55d0  ONLINE       0     0     0
            c3t5000CCA00AAA0D1Dd0  ONLINE       0     0     0
            c3t5000CCA00AA95559d0  ONLINE       0     0     0
            c3t5000CCA00AAAD155d0  ONLINE       0     0     0
            c3t5000CCA015214845d0  ONLINE       0     0     0
            c3t5000CCA015214F85d0  ONLINE       0     0     0
            c3t5000CCA01521070Dd0  ONLINE       0     0     0
            c3t5000CCA0151A287Dd0  ONLINE       0     0     0

errors: No known data errors
cwafseng3 $ 
See also croinfo(1M) for how to get this information (and more) for all disk drives regardless if they are part of zpool or not.

Friday, January 20, 2012

MWAC in Global Zone

Solaris 11 has a new cool feature called Immutable Zones. Darren Moffat presented new features in Solaris 11 Zones at the last LOSUG meeting in London. Immutable Zones basically allow for read-only or partially read-only Zones to be deployed. You can even combine it with ZFS encryption - see Darren's blog entry for more details. The underlying technology to immutable zones  is called Mandatory Write Access Control (MWAC) and is implemented in kernel. So for each open, unlink, etc. syscall a VFS layer checks if MWAC is enabled for a given filesystem and a zone and if it is it will check white and black lists associated with a zone and potentially deny write access to a file (generating EROFS). The actual definitions for different default profiles are located in /usr/lib/brand/solaris/config.xml file. It is *very* simple to use the pre-defined profiles when creating a zone and it just works. Really cool. Thanks Darren for the great demo.

Now MWAC only works with non-global zones... at least by default. There is no public interface exposed to manipulate MWAC rules directly or to enable it for a global zone but it doesn't mean one can't try to do it anyway. DTrace, objdump, mdb, etc. were very helpful here to see what's going on. The result of having couple of hours of fun is below.

root@global # touch /test/file1
root@global # rm -f /test/file1
root@global # ./mwac -b "/test/file1"
MWAC black list for the global zone installed.

root@global # touch /test/file1
touch: cannot create /test/file1: Read-only file system
root@global # touch /test/file2 ; rm /test/file2
root@global # 
Now lets disable MWAC again:
root@global # mwac -u
MWAC unlock succeeded.

root@global # touch /test/file1 ; rm /test/file1
root@global # 
You can even use patterns:
root@global # mwac -b "/test/*"
MWAC black list for the global zone installed.

root@global # touch /test/a ; mkdir /test/b
touch: cannot create /test/a: Read-only file system
mkdir: Failed to make directory "/test/b"; Read-only file system
root@global # 

Thursday, January 19, 2012

ReFS

Next generation file system for Windows: ReFS
It looks pretty interesting and promising. Something like ZFS lite for Windows.

"The key goals of ReFS are:
  • Maintain a high degree of compatibility with a subset of NTFS features that are widely adopted while deprecating others that provide limited value at the cost of system complexity and footprint.
  • Verify and auto-correct data. Data can get corrupted due to a number of reasons and therefore must be verified and, when possible, corrected automatically. Metadata must not be written in place to avoid the possibility of “torn writes,” which we will talk about in more detail below.
  • Optimize for extreme scale. Use scalable structures for everything. Don’t assume that disk-checking algorithms, in particular, can scale to the size of the entire file system.
  • Never take the file system offline. Assume that in the event of corruptions, it is advantageous to isolate the fault while allowing access to the rest of the volume. This is done while salvaging the maximum amount of data possible, all done live.
  • Provide a full end-to-end resiliency architecture when used in conjunction with the Storage Spaces feature, which was co-designed and built in conjunction with ReFS.
The key features of ReFS are as follows (note that some of these features are provided in conjunction with Storage Spaces).
  • Metadata integrity with checksums
  • Integrity streams providing optional user data integrity
  • Allocate on write transactional model for robust disk updates (also known as copy on write)
  • Large volume, file and directory sizes
  • Storage pooling and virtualization makes file system creation and management easy
  • Data striping for performance (bandwidth can be managed) and redundancy for fault tolerance
  • Disk scrubbing for protection against latent disk errors
  • Resiliency to corruptions with "salvage" for maximum volume availability in all cases
  • Shared storage pools across machines for additional failure tolerance and load balancing
"

Wednesday, January 04, 2012

What is Watson?

You probably heard about Watson from IBM. Michael Perrone gave a very entertaining presentation on Watson at LISA. Enjoy.