Tuesday, November 20, 2012
Friday, November 09, 2012
vmtasks explained
Solaris 11 introduced a new kernel process called "vmtasks" which accelerates some operations when working with shared memory. For more details see here.
Tuesday, October 23, 2012
Running OpenAFS on Solaris 11 x86 / ZFS
Recently I gave a talk on running OpenAFS services on top of Solaris 11 x86 / ZFS. The talk was split in two parts - first part about $$ benefits of transparent ZFS compression, when running on 3rd party x86 hardware (but it also makes sense when running on Sun/Oracle kit - in some cases even more so). This part also discusses some ideas about running AFS on internal disks instead of directly attached disk arrays, which again, thanks to ZFS built-in compression makes it worthwhile and delivers even more $$ savings.
The main message of this part is, that if your data compresses well (above 2x), running OpenAFS on ZFS can deliver similar or even better performance but most importantly it can save you lots of $$, both in acquisition costs, and in cost of running AFS plant. In most cases you should even be able to re-use the current x86 hardware you have. The beauty of AFS is, that we were able to migrate data from Linux to Solaris/ZFS, in-place, by re-using the same x86 HW, and all of this was completely transparent to all clients (keep in mind we are talking about PBs of data) - this is truly the cloud file system. I think OpenAFS is one of the under-appreciated technologies in the market.
The second part is about using DTrace, both in dev and in production systems, to find scalability and performance bottlenecks, and other bugs as well. Two easy and real-life examples are discussed, which resulted in considerable improvement in scalability and performance of some operations in OpenAFS, along with some other examples of D scripts which provide top-like output with some statistics (slide #32 is an example from a Solaris NFS server, serving VMWare clients and displaying different stats per VM from a single file system...). DTrace has proven to be a very powerful and helpful tool for us, although it is hard to put a specific $ value it brings.
The slides should be available here.
Wednesday, August 29, 2012
Open Indiana is dead
With the main guy behind the project resigning, OI is essentially dead. It's been dead for some time though and with no commercial backing it never really had much chance. This is sad news indeed (although I haven't really used OI). It marks the end of Open Solaris era.
Can Illumos survive in the long term? Can it become relevant outside of couple of niche use cases?
Ironically, it is Oracle's Solaris which will probably outlive all of them.
Can Illumos survive in the long term? Can it become relevant outside of couple of niche use cases?
Ironically, it is Oracle's Solaris which will probably outlive all of them.
Friday, August 24, 2012
Friday, July 27, 2012
Locking a running process in RAM
Recently I was looking at a possibility of locking some or all memory mappings, for an already running process, in RAM. Why would you want to do it? There might be many reasons. One of them is to prevent a critical process to be swapped out.
Now, how can we do it without restarting a process or changing its code? There is memcntl() and similar calls like plock(), mlock(), etc. - but they only work for a process which is calling them. However libproc on Solaris allows you to run memcntl() in a context of other process, among many other cool things. By the way - libproc is used by tools like truss, ppgsz, etc.
First, lets see that it actually works.
Here is a bash process running as a non-root user, see that no mappings are locked in RAM.
cwafseng3 $ pmap -ax $$ 27709: bash Address Kbytes RSS Anon Locked Mode Mapped File 08050000 968 968 - - r-x-- bash 08151000 76 76 8 - rwx-- bash 08164000 140 140 56 - rwx-- [ heap ] F05D0000 440 440 - - r-x-- libnsl.so.1 F064E000 8 8 4 - rw--- libnsl.so.1 F0650000 20 12 - - rw--- libnsl.so.1 F0660000 56 56 - - r-x-- libsocket.so.1 F067E000 4 4 - - rw--- libsocket.so.1 FE560000 64 16 - - rwx-- [ anon ] FE577000 4 4 4 - rwxs- [ anon ] FE580000 24 12 4 - rwx-- [ anon ] FE590000 4 4 4 - rw--- [ anon ] FE5A0000 1352 1352 - - r-x-- libc_hwcap1.so.1 FE702000 44 44 16 - rwx-- libc_hwcap1.so.1 FE70D000 4 4 - - rwx-- libc_hwcap1.so.1 FE710000 4 4 - - r-x-- libdl.so.1 FE720000 4 4 4 - rw--- [ anon ] FE730000 184 184 - - r-x-- libcurses.so.1 FE76E000 16 16 - - rw--- libcurses.so.1 FE772000 8 8 - - rw--- libcurses.so.1 FE780000 4 4 4 - rw--- [ anon ] FE790000 4 4 4 - rw--- [ anon ] FE7A0000 4 4 - - rw--- [ anon ] FE7AD000 4 4 - - r--s- [ anon ] FE7B4000 220 220 - - r-x-- ld.so.1 FE7FB000 8 8 4 - rwx-- ld.so.1 FE7FD000 4 4 - - rwx-- ld.so.1 FEFFB000 16 16 4 - rw--- [ stack ] -------- ------- ------- ------- ------- total Kb 3688 3620 116 -
Now, I will use the small tool I wrote, to lock all mappings with RX or RWX protections on them. The tool requires for a PID to be specified.
$ ./pr_memcntl 27709 pr_memcntl() failed: Not owner
Although I run it as root, it failed. Remember when I wrote a moment ago, that libproc would call memcntl() from a contetx of the target process? The target process here is bash with pid 27709 and it is running as a standard user, and by default a standard user cannot lock pages in memory. We can add the required privileges for locking pages in RAM. Lets see what's missing.
I enabled privilege debugging for the bash process and run the pr_memcntl tool again
$ ppriv -D 27709 bash[27709]: missing privilege "proc_lock_memory" (euid = 145104, syscall = 131) needed at memcntl+0x140
Now, lets add the missing privilege to the bash process and then try again locking the mappings:
$ ppriv -s EP+proc_lock_memory 27709 $ ./pr_memcntl 27709 $This time no error, lets see the pmap output again:
$ pmap -ax $$ 27709: bash Address Kbytes RSS Anon Locked Mode Mapped File 08050000 968 968 - 968 r-x-- bash 08151000 76 76 8 76 rwx-- bash 08164000 140 140 48 140 rwx-- [ heap ] F05D0000 440 440 - 440 r-x-- libnsl.so.1 F064E000 8 8 4 - rw--- libnsl.so.1 F0650000 20 12 - - rw--- libnsl.so.1 F0660000 56 56 - 56 r-x-- libsocket.so.1 F067E000 4 4 - - rw--- libsocket.so.1 FE560000 64 64 - 64 rwx-- [ anon ] FE577000 4 4 4 - rwxs- [ anon ] FE580000 24 24 4 24 rwx-- [ anon ] FE590000 4 4 4 - rw--- [ anon ] FE5A0000 1352 1352 - 1352 r-x-- libc_hwcap1.so.1 FE702000 44 44 16 44 rwx-- libc_hwcap1.so.1 FE70D000 4 4 - 4 rwx-- libc_hwcap1.so.1 FE710000 4 4 - 4 r-x-- libdl.so.1 FE720000 4 4 4 - rw--- [ anon ] FE730000 184 184 - 184 r-x-- libcurses.so.1 FE76E000 16 16 - - rw--- libcurses.so.1 FE772000 8 8 - - rw--- libcurses.so.1 FE780000 4 4 4 - rw--- [ anon ] FE790000 4 4 4 - rw--- [ anon ] FE7A0000 4 4 - - rw--- [ anon ] FE7AD000 4 4 - - r--s- [ anon ] FE7B4000 220 220 - 220 r-x-- ld.so.1 FE7FB000 8 8 4 8 rwx-- ld.so.1 FE7FD000 4 4 - 4 rwx-- ld.so.1 FEFFB000 16 16 4 - rw--- [ stack ] -------- ------- ------- ------- ------- total Kb 3688 3680 108 3588
It works! :)
Notice that only mappings with RX or RWX protections are locked (as hard-coded in the tool, but any mappings or entire process can be locked if desired).
The tool can obviously easily be expanded so it automatically adds the privilege if needed.
The C code below is a prototype - it is not idiot proof nor does it handle all errors, and it could be more user friendly - but it works and it is trivial to extend it. It should work on Solaris 10 and Solaris 11 and also on all Illumos based distributions.
Notice, that it sets MCL_CURRENT|MCL_FUTURE flags, meaning that not only current rx|rwx mappings are locked but also all future mappings, with the rx|rwx protections set, will be locked as well. In order to compile the program you need libproc.h which is currently not distributed with Solaris (hopefully it will change soon). You can get a copy from here.
Some ideas on how the tool could be easily extended:
- add an option to add the proc_lock_memory automatically (and perhaps update resource limit as well)
- add an option to remove a lock on specific mapping or from all mappings
- add an option to specify what should be lock
- add options to specify if only current mappings and/or future mappings should be locked as well
ps. putting a process in RT class would achieve a similar result, although it wouldn't give a control over which mappings should be locked, and running in RT might not be desirable for other reasons as well
// gcc -m64 -lproc -I. -o pr_memcntl pr_memcntl.c #include <sys/types.h> #include <sys/mman.h> #include <stdio.h> #include <stdlib.h> #include <libproc.h> int main(int argc, char **argv) { pid_t pid; int perr; static struct ps_prochandle *Pr; pid = atoi(argv[1]); if((Pr = Pgrab(pid, PGRAB_NOSTOP, &perr)) == NULL) { printf("Pgrab() failed: %s\n", Pgrab_error(perr)); exit(1); } if(pr_memcntl(Pr, 0, 0, MC_LOCKAS, (caddr_t)(MCL_CURRENT|MCL_FUTURE), PROC_TEXT, (int)0)) { perror("pr_memcntl() failed"); Prelease(Pr, 0); exit(1); } if(pr_memcntl(Pr, 0, 0, MC_LOCKAS, (caddr_t)(MCL_CURRENT|MCL_FUTURE), PROC_TEXT|PROT_WRITE, (int)0)) { perror("pr_memcntl() failed"); Prelease(Pr, 0); exit(1); } Prelease(Pr, 0); Pr = NULL; exit(0); }
Wednesday, May 02, 2012
Physical disk locations
zpool(1M) has a very handy option to display physical disk locations on some hardware.
cwafseng3 $ zpool status -l cwafseng3-0 pool: cwafseng3-0 state: ONLINE scan: scrub canceled on Thu Apr 12 13:52:13 2012 config: NAME STATE READ WRITE CKSUM cwafseng3-0 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 /dev/chassis/i86pc.unknown/HDD02/disk ONLINE 0 0 0 /dev/chassis/i86pc.unknown/HDD23/disk ONLINE 0 0 0 /dev/chassis/i86pc.unknown/HDD22/disk ONLINE 0 0 0 /dev/chassis/i86pc.unknown/HDD21/disk ONLINE 0 0 0 /dev/chassis/i86pc.unknown/HDD20/disk ONLINE 0 0 0 /dev/chassis/i86pc.unknown/HDD19/disk ONLINE 0 0 0 /dev/chassis/i86pc.unknown/HDD17/disk ONLINE 0 0 0 /dev/chassis/i86pc.unknown/HDD15/disk ONLINE 0 0 0 errors: No known data errors cwafseng3 $The HDDXX entries directly correspond to the physical slots, in this case in Sun x4270M2 server. Compare it to the standard output:
cwafseng3 $ zpool status cwafseng3-0 pool: cwafseng3-0 state: ONLINE scan: scrub canceled on Thu Apr 12 13:52:13 2012 config: NAME STATE READ WRITE CKSUM cwafseng3-0 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c3t5000CCA00AC87F55d0 ONLINE 0 0 0 c3t5000CCA00AAA0D1Dd0 ONLINE 0 0 0 c3t5000CCA00AA95559d0 ONLINE 0 0 0 c3t5000CCA00AAAD155d0 ONLINE 0 0 0 c3t5000CCA015214845d0 ONLINE 0 0 0 c3t5000CCA015214F85d0 ONLINE 0 0 0 c3t5000CCA01521070Dd0 ONLINE 0 0 0 c3t5000CCA0151A287Dd0 ONLINE 0 0 0 errors: No known data errors cwafseng3 $See also croinfo(1M) for how to get this information (and more) for all disk drives regardless if they are part of zpool or not.
Friday, March 23, 2012
Illumos: Google Summer of Code
Are you a student and would like to do some interesting coding during the summer? If yes, then check this.
Friday, January 20, 2012
MWAC in Global Zone
Solaris 11 has a new cool feature called Immutable Zones. Darren Moffat presented new features in Solaris 11 Zones at the last LOSUG meeting in London. Immutable Zones basically allow for read-only or partially read-only Zones to be deployed. You can even combine it with ZFS encryption - see Darren's blog entry for more details. The underlying technology to immutable zones is called Mandatory Write Access Control (MWAC) and is implemented in kernel. So for each open, unlink, etc. syscall a VFS layer checks if MWAC is enabled for a given filesystem and a zone and if it is it will check white and black lists associated with a zone and potentially deny write access to a file (generating EROFS). The actual definitions for different default profiles are located in /usr/lib/brand/solaris/config.xml file. It is *very* simple to use the pre-defined profiles when creating a zone and it just works. Really cool. Thanks Darren for the great demo.
Now MWAC only works with non-global zones... at least by default. There is no public interface exposed to manipulate MWAC rules directly or to enable it for a global zone but it doesn't mean one can't try to do it anyway. DTrace, objdump, mdb, etc. were very helpful here to see what's going on. The result of having couple of hours of fun is below.
Now MWAC only works with non-global zones... at least by default. There is no public interface exposed to manipulate MWAC rules directly or to enable it for a global zone but it doesn't mean one can't try to do it anyway. DTrace, objdump, mdb, etc. were very helpful here to see what's going on. The result of having couple of hours of fun is below.
root@global # touch /test/file1 root@global # rm -f /test/file1 root@global # ./mwac -b "/test/file1" MWAC black list for the global zone installed. root@global # touch /test/file1 touch: cannot create /test/file1: Read-only file system root@global # touch /test/file2 ; rm /test/file2 root@global #Now lets disable MWAC again:
root@global # mwac -u MWAC unlock succeeded. root@global # touch /test/file1 ; rm /test/file1 root@global #You can even use patterns:
root@global # mwac -b "/test/*" MWAC black list for the global zone installed. root@global # touch /test/a ; mkdir /test/b touch: cannot create /test/a: Read-only file system mkdir: Failed to make directory "/test/b"; Read-only file system root@global #
Thursday, January 19, 2012
ReFS
Next generation file system for Windows: ReFS
It looks pretty interesting and promising. Something like ZFS lite for Windows.
It looks pretty interesting and promising. Something like ZFS lite for Windows.
"The key goals of ReFS are:
The key features of ReFS are as follows (note that some of these features are provided in conjunction with Storage Spaces).
- Maintain a high degree of compatibility with a subset of NTFS features that are widely adopted while deprecating others that provide limited value at the cost of system complexity and footprint.
- Verify and auto-correct data. Data can get corrupted due to a number of reasons and therefore must be verified and, when possible, corrected automatically. Metadata must not be written in place to avoid the possibility of “torn writes,” which we will talk about in more detail below.
- Optimize for extreme scale. Use scalable structures for everything. Don’t assume that disk-checking algorithms, in particular, can scale to the size of the entire file system.
- Never take the file system offline. Assume that in the event of corruptions, it is advantageous to isolate the fault while allowing access to the rest of the volume. This is done while salvaging the maximum amount of data possible, all done live.
- Provide a full end-to-end resiliency architecture when used in conjunction with the Storage Spaces feature, which was co-designed and built in conjunction with ReFS.
- Metadata integrity with checksums
- Integrity streams providing optional user data integrity
- Allocate on write transactional model for robust disk updates (also known as copy on write)
- Large volume, file and directory sizes
- Storage pooling and virtualization makes file system creation and management easy
- Data striping for performance (bandwidth can be managed) and redundancy for fault tolerance
- Disk scrubbing for protection against latent disk errors
- Resiliency to corruptions with "salvage" for maximum volume availability in all cases
- Shared storage pools across machines for additional failure tolerance and load balancing
"
Wednesday, January 04, 2012
What is Watson?
You probably heard about Watson from IBM. Michael Perrone gave a very entertaining presentation on Watson at LISA. Enjoy.