Friday, June 12, 2015

Monday, March 23, 2015

Physical Locations of PCI SSDs

The latest update to Solaris 11 (SRU has a new feature - it can identify physical locations of F40 and F80 PCI SSDs cards - it registers them under the Topology Framework.

Here is an example diskinfo output on x4-2l server with 24 SSDs in front presented as JBOD, 2x SSDs in the rear mirrored with RAID controller (for OS), and 4x PCI F80 cards (each card presents 4 LUNs):

$ diskinfo
D:devchassis-path                        c:occupant-compdev
---------------------------------------  ---------------------
/dev/chassis/SYS/HDD00/disk              c0t55CD2E404B64A3E9d0
/dev/chassis/SYS/HDD01/disk              c0t55CD2E404B64B1ABd0
/dev/chassis/SYS/HDD02/disk              c0t55CD2E404B64B1BDd0
/dev/chassis/SYS/HDD03/disk              c0t55CD2E404B649E02d0
/dev/chassis/SYS/HDD04/disk              c0t55CD2E404B64A33Ed0
/dev/chassis/SYS/HDD05/disk              c0t55CD2E404B649DB5d0
/dev/chassis/SYS/HDD06/disk              c0t55CD2E404B649DBCd0
/dev/chassis/SYS/HDD07/disk              c0t55CD2E404B64AB2Fd0
/dev/chassis/SYS/HDD08/disk              c0t55CD2E404B64AC96d0
/dev/chassis/SYS/HDD09/disk              c0t55CD2E404B64A580d0
/dev/chassis/SYS/HDD10/disk              c0t55CD2E404B64ACC5d0
/dev/chassis/SYS/HDD11/disk              c0t55CD2E404B64B1DAd0
/dev/chassis/SYS/HDD12/disk              c0t55CD2E404B64ACF1d0
/dev/chassis/SYS/HDD13/disk              c0t55CD2E404B649EE1d0
/dev/chassis/SYS/HDD14/disk              c0t55CD2E404B64A581d0
/dev/chassis/SYS/HDD15/disk              c0t55CD2E404B64AB9Cd0
/dev/chassis/SYS/HDD16/disk              c0t55CD2E404B649DCAd0
/dev/chassis/SYS/HDD17/disk              c0t55CD2E404B6499CBd0
/dev/chassis/SYS/HDD18/disk              c0t55CD2E404B64AC98d0
/dev/chassis/SYS/HDD19/disk              c0t55CD2E404B6499B7d0
/dev/chassis/SYS/HDD20/disk              c0t55CD2E404B64AB05d0
/dev/chassis/SYS/HDD21/disk              c0t55CD2E404B64A33Fd0
/dev/chassis/SYS/HDD22/disk              c0t55CD2E404B64AB1Cd0
/dev/chassis/SYS/HDD23/disk              c0t55CD2E404B64A3CFd0
/dev/chassis/SYS/HDD24                   -
/dev/chassis/SYS/HDD25                   -
/dev/chassis/SYS/MB/PCIE1/F80/LUN0/disk  c0t5002361000260451d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN1/disk  c0t5002361000258611d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN2/disk  c0t5002361000259912d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN3/disk  c0t5002361000259352d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN0/disk  c0t5002361000262937d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN1/disk  c0t5002361000262571d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN2/disk  c0t5002361000262564d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN3/disk  c0t5002361000262071d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN0/disk  c0t5002361000125858d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN1/disk  c0t5002361000125874d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN2/disk  c0t5002361000194066d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN3/disk  c0t5002361000142889d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN0/disk  c0t5002361000371137d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN1/disk  c0t5002361000371435d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN2/disk  c0t5002361000371821d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN3/disk  c0t5002361000371721d0

Let's create a ZFS pool on top of the F80s and see zpool status output:
(you can use the SYS/MB/... names when creating a pool as well)

$ zpool status -l XXXXXXXXXXXXXXXXXXXX-1
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Mar 21 11:31:01 2015

        NAME                                         STATE     READ WRITE CKSUM
        XXXXXXXXXXXXXXXXXXXX-1                       ONLINE       0     0     0
          mirror-0                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE4/F80/LUN0/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE1/F80/LUN1/disk  ONLINE       0     0     0
          mirror-1                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE4/F80/LUN1/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE1/F80/LUN3/disk  ONLINE       0     0     0
          mirror-2                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE4/F80/LUN3/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE1/F80/LUN2/disk  ONLINE       0     0     0
          mirror-3                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE4/F80/LUN2/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE1/F80/LUN0/disk  ONLINE       0     0     0
          mirror-4                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE2/F80/LUN3/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE5/F80/LUN0/disk  ONLINE       0     0     0
          mirror-5                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE2/F80/LUN2/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE5/F80/LUN1/disk  ONLINE       0     0     0
          mirror-6                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE2/F80/LUN1/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE5/F80/LUN3/disk  ONLINE       0     0     0
          mirror-7                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE2/F80/LUN0/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE5/F80/LUN2/disk  ONLINE       0     0     0

errors: No known data errors

It also means that all FMA alerts should include the physical path as well, which should make identification of a given F80/LUN, if something goes wrong, so much easier.

Saturday, March 21, 2015

ZFS: Persistent L2ARC

Solaris SRU delivers persistent L2ARC. What is interesting about it is that it stores raw ZFS blocks, so if you enabled compression then L2ARC will also store compressed blocks (so it can store more data). Similarly with encryption.

Friday, March 20, 2015

Managing Solaris with RAD

Solaris 11 provides "The Remote Administration Daemon, commonly referred to by its acronymand command name, rad, is a standard system service thatoffers secure, remote administrative access to an Oracle Solaris system."

RAD is essentially an API to programmatically manage and query different Solaris subsystems like networking, zones, kstat, smf, etc.

Let's see an example on how to use it to list all zones configured on a local system.

# cat

import rad.client as radcli
import rad.connect as radcon
import as zbind

with radcon.connect_unix() as rc:
    zones = rc.list_objects(zbind.Zone())
    for i in range(0, len(zones)):
        zone = rc.get_object(zones[i])
        print "zone: %s (%S)" % (, zone.state)
        for prop in zone.getResourceProperties(zbind.Resource('global')):
            if == 'zonename':
            print "\t%-20s : %s" % (, prop.value)

# ./
zone: kz1 (configured)
        zonepath:           :
        brand               : solarisk-kz
        autoboot            : false
        autoshutdown        : shutdown
        bootargs            :
        file-mac-profile    :
        pool                :
        scheduling-class    :
        ip-type             : exclusive
        hostid              : 0x44497532
        tenant              :
zone: kz2 (installed)
        zonepath:           : /system/zones/%{zonename}
        brand               : solarisk-kz
        autoboot            : false
        autoshutdown        : shutdown
        bootargs            :
        file-mac-profile    :
        pool                :
        scheduling-class    :
        ip-type             : exclusive
        hostid              : 0x41d45bb
        tenant              :

Or another example to show how to create a new Kernel Zone with autoboot property set to true:


import sys

import rad.client
import rad.connect
import as zonemgr

class SolarisZoneManager:
    def __init__(self):
        self.description = "Solaris Zone Manager"

    def init_rad(self):
            self.rad_instance = rad.connect.connect_unix()
        except Exception as reason:
        print "Cannoct connect to RAD: %s" % (reason)

    def get_zone_by_name(self, name):
            pat = rad.client.ADRGlobPatter({'name# : name})
            zone = self.rad_instance.get_object(zonemgr.Zone(), pat)
        except rad.client.NotFoundError:
            return None
        except Exception as reason:
            print "%s: %s" % (self.__class__.__name__, reason)
            return None

        return zone

    def zone_get_resource_prop(self, zone, resource, prop, filter=None):
            val = zone.getResourceProperties(zonemgr.Resource(resource, filter), [prop])
        except rad.client.ObjectError:
            return None
        except Exception as reason:
            print "%s: %s" % (self.__class__.__name__, reason)
            return None

        return val[0].value if val else None

    def zone_set_resource_prop(self, zone, resource, prop, val):
        current_val = self.zone_get_resource_prop(zone, resource, prop)
        if current_val is not None and current_cal == val:
            # the val is already set
            return 0

            if current_cal is None:
                zone.addResource(zonemgr.Resource(resource, [zonemgr.Property(prop, val)]))
                zone.setResourceProperties(zonemgr.Resource(resource), [zonemgr.Property(prop, val)])
        except rad.client.ObjectError as err:
            print "Failed to set %s property on %s resource for zone %s: %s" % (prop, resource,, err)
            return 0

        return 1

    def zone_create(self, name, template):
        zonemanager = self.rad_instance.get_object(zonemg.ZoneManager())
        zonemanager.create(name, None, template)
        zone = self.get_zone_by_name(name)
            self.zone_set_resource_prop(zone, 'global', 'autoboot', true')
        except Exception as reason:
            print "%s: %s" % (self.__class__.__name__, reason)
            return 0
        return 1

x = SolarisZoneManager()
if x.zone_create(str(sys.argv[1]), 'SYSsolaris-kz'):
    print "Zone created succesfully." 

There are many simple examples in  zonemgr.3rad man page, and what I found very useful is to look at solariszones/ from OpenStack. It is actually very interesting that OpenStack is using RAD on Solaris.

RAD is very powerful, and with more modules being constantly added it is becoming a  powerful programmatic API to remotely manage Solaris systems. It is also very useful if you are writing components to a configuration management system for Solaris.

What's the most anticipated RAD module currently missing in stable Solaris? I think it is ZFS module... 

Friday, February 27, 2015

Specifying Physical Disk Locations for AI Installations

One of the very useful features in Solaris is the ability to identify physical disk location on supported hardware (mainly Oracle x86 and SPARC servers). This not only makes it easier to identify a faulty disk to be replaced but also makes OS installation more robust, as you can actually specify physical disk locations in a given server model where OS should be installed. Here is an example output from diskinfo tool on x5-2l server: 

$ diskinfo
D:devchassis-path            c:occupant-compdev
---------------------------  ---------------------
/dev/chassis/SYS/HDD00/disk  c0t5000CCA01D3A1A24d0
/dev/chassis/SYS/HDD01/disk  c0t5000CCA01D2EB40Cd0
/dev/chassis/SYS/HDD02/disk  c0t5000CCA01D30FD90d0
/dev/chassis/SYS/HDD03/disk  c0t5000CCA032018CB4d0
/dev/chassis/SYS/RHDD0/disk  c0t5000CCA01D34EB38d0
/dev/chassis/SYS/RHDD1/disk  c0t5000CCA01D315288d0

The server supports 24x disks in front and another two disks in the back. We use the front disks for data and the two disks in the back for OS. In the past we uses RAID controller to mirror the two OS disks, while all disks in the front were presented in pass-thru mode (JBOD) and managed by ZFS.

Recently I started looking into using ZFS for mirroring the OS disks as well. Notice in the above output that the two disks in the back of x5-2l server are identified as: SYS/RHDD0 SYS/RHDD1.

This is very useful as with SAS the CTD would be different for each disk and woudl also change if a disk was replaced, while the SYS/[R}HDDn location would always stay the same.

See also my older blog entry on how this information is presented in other subsystems (FMA or ZFS).

Below is a part of AI manifest which defines that OS should be installed on the two rear disks and mirrored by ZFS:
      <disk in_vdev="mirror" in_zpool="rpool" whole_disk="true">
        <disk_name name="SYS/RHDD0" name_type="receptacle">
      <disk in_vdev="mirror" in_zpool="rpool" whole_disk="true">
        <disk_name name="SYS/RHDD1" name_type="receptacle">
        <zpool is_root="true" name="rpool">
          <vdev name="mirror" redundancy="mirror">

In our environment the AI manifest is generated per server from a configuration management system based on a host profile. This means that for x5-2l servers we generate AI manifest as shown above, but on some other servers we want OS to be installed on a RAID volume, and on a general server which doesn't fall into any specific category we install OS on boot_disk. So depending on the server we generate different sections in AI manifest. This is similar to derived manifests in AI but instead of being separate to a configuration management system in our case it is part of it.

Wednesday, February 04, 2015

Native IPS Manifests

We used to use pkgbuild tool to generate IPS packages. However recently I started working on internal Solaris SPARC build and we decided to use IPS fat packages for x86 and SPARC platforms, similarly to how Oracle delivers Solaris itself. We could keep using pkgbuild but as it always puts a variant of a host on which it was executed from, it means that we would have to run it once on a x86 server, once on a SPARC server, each time publishing to a separate repository and then use pkgmerge to create a fat package and publish it into a 3rd repo.

Since we have all our binaries already compiled for all platforms, when we build a package (RPM, IPS, etc.) all we have to do is to pick up proper files, add metadata and publish a package. No point in having three repositories and at least two hosts involved in publishing a package.

In our case native IPS manifest is a better (simpler) way to do it - we can publish a fat package from a single server to its final repository in a single step.

What is also useful is that pkgmogrify transformations can be listed in the same manifest file. Entire file is loaded first and then any transformations would be run in the specified order and new manifest will be printed to stdout. This means that in most cases we can have a single file for each package we want to generate, similarily to pkgbuild. There are cases where there are lots of files and we do use pkgsend generate to generate all files and directories, and then we have a separate file with metadata and transformations. In this case pkgbuild is a little bit easier to understand compared to what native IPS tooling offers, but it actually is not that bad.

Let's see an example IPS manifest, with some basic transformations and with both x86 and SPARC binaries.

set name=pkg.fmri value=pkg://ms/ms/pam/access@$(PKG_VERSION).$(PKG_RELEASE),5.11-0
set name=pkg.summary value="PAM pam_access library"
set name=pkg.description value="PAM pam_access module. Compiled from Linux-PAM-1.1.6."
set name=info.classification value=""
set name=info.maintainer value="Robert Milkowski "

set name=variant.arch value=i386 value=sparc

depend type=require fmri=ms/pam/libpam@$(PKG_VERSION).$(PKG_RELEASE)

dir group=sys mode=0755 owner=root path=usr
dir group=bin mode=0755 owner=root path=usr/lib
dir group=bin mode=0755 owner=root path=usr/lib/security
dir group=bin mode=0755 owner=root path=usr/lib/security/amd64      variant.arch=i386
dir group=bin mode=0755 owner=root path=usr/lib/security/sparcv9    variant.arch=sparc

&lttransform file -> default mode 0555>
&lttransform file -> default group bin>
&lttransform file -> default owner root>

# i386
file SOURCES/Linux-PAM/libs/intel/32/    path=usr/lib/security/          variant.arch=i386
file SOURCES/Linux-PAM/libs/intel/64/    path=usr/lib/security/amd64/    variant.arch=i386

# sparc
file SOURCES/Linux-PAM/libs/sparc/32/    path=usr/lib/security/          variant.arch=sparc
file SOURCES/Linux-PAM/libs/sparc/64/    path=usr/lib/security/sparcv9/  variant.arch=sparc

We can then publish the manifest by running:
$ pkgmogrify -D PKG_VERSION=1.1.6 -D PKG_RELEASE=1 SPECS/ms-pam-access.manifest | \
    pkgsend publish -s /path/to/IPS/repo
This would really go into a Makefile so in order to publish a package one does something like:
$ PUBLISH_REPO=file:///xxxxx/ gmake publish-ms-pam-access
In case where there are too many files to list them manually in the manifest, you can use pkgsend generate to generate a full list of files and directories. You need to create a manifest with only package meta data and all transformations (which would put files in their proper locations, set desired owner, group, etc.). In order to publish a package one puts into a Makefile somethine like:
$ pkgsend generate SOURCES/LWP/5.805 >BUILD/ms-perl-LWP.files
$ pkgmogrify -D PKG_VERSION=5 -D PKG_RELEASE=805 SPECS/ms-perl-LWP.p5m BUILD/ms-perl-LWP.files | \
    pkgsend publish -s /path/to/IPS/repo

Friday, January 16, 2015

ZFS: Persistent L2ARC

Recently Oracle integrated persistent L2ARC in ZFS and this is currently available in the ZFS-SA. It is not yet in Solaris 11 but it should be coming soon. Finally!

To make the very good news even better - it stores blocks in their raw format, so if for example you have compression enabled in your pool then L2ARC will store blocks compressed as well (similarly for encryption). If your data compresses well you L2ARC suddenly became much bigger as well.

Friday, January 09, 2015

Docker on SmartOS

Bryan blogged about running Linux Docker containers on SmartOS. Really cool. Now I would love to see something similar in Solaris 11...