Friday, July 01, 2016

SPARC S7

Oracle released new SPARC S7 CPU and SPARC S7-2 and S7-2L servers. This is really interesting SPARC CPU if you need low-end servers, the first one in many, many years which can compete with x86 both in performance and price. It has some unique features as well.

See launch video.

Various articles on S7:

TheNextPlatform
The Register
PCWorld
ComputerWorld

Also see some benchmarks already published:

SPECjbb2015
SPECjEnterprise2010
Database: S7 vs x86
Yahoo Cloud Serving Benchmark


Friday, May 20, 2016

Adjusting SO_RCVBUF of a running process

Recently I was looking into how to increase SO_RCVBUF size for a given socket in a running process, without having to restart it. This could be useful, if an application can't be restarted anytime soon, yet there are drops observed due to too low receive buffer set, or perhaps a given application doesn't even allow for the receive buffer to be specified and has a hard-coded value. In my case, an application does allow for the buffer to be specified, but it only sets it on startup and I couldn't restart it.

Solaris (nor Linux AFAIK) does not provide a tool to easily adjust the buffer for a socket in a running process, so I looked if I could do it via libproc. The answer is yes, and it is pretty straightforward.

I quickly wrote a small C program which changes the SO_RCVBUF size for a given pid and file descriptor number. Let's see an example on how to use it.

There is a process with pid 893 listening on port UDP/32623 with the SO_RCVBUG currently set to 128104:

# pfiles 893
893:    /usr/local/bin/test-daemon
  Current rlimit: 256 file descriptors
...
   4: S_IFSOCK mode:0666 dev:574,0 ino:43685 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
      SOCK_STREAM
      SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(128104)
      sockname: AF_INET 0.0.0.0  port: 32623
      congestion control: newreno
...

Let's change the SO_RCVBUG to a higher value:

# ./pr_setsockopt 893 4 500000
Current SO_RCVBUG is 128104
New SO_RCVBUG is 500088

# pfiles 893
...
   4: S_IFSOCK mode:0666 dev:574,0 ino:43685 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
      SOCK_STREAM
      SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(500088)
      sockname: AF_INET 0.0.0.0  port: 32623
      congestion control: newreno
...

The code is very similar to the one I wrote last time. However, as there is no pr_setsockopt() wrapper function, I wrote one based on how the other pr_* functions are implemented, specifically pr_getsockopt(). The trick is that there is Psyscall() function available, which allows you to call any syscall from the target process, so all that is required is to use it to call SYS_setsockopt.

As the source code for Solaris is no longer publicly available, I used Illumos source code. The program was tested only on Solaris 11 x86, although it probably works fine on Solaris 10 and Illumos, and should work on SPARC as well.

It is a quick "hack", with no safeguards, no proper argument parsing, etc.
Use it at your own risk.

// gcc -m64 -lproc -o pr_setsockopt pr_setsockopt.c

#include <sys/types.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <libproc.h>


pr_setsockopt(struct ps_prochandle *Pr, int sock, int level, int optname, void *optval, int optlen) {
  sysret_t rval;  /* return value from getsockopt() */
  argdes_t argd[5]; /* arg descriptors for getsockopt() */
  argdes_t *adp;
  int error;

  if (Pr == NULL)  /* no subject process */
    return (_so_setsockopt(sock, level, optname, optval, optlen));

  adp = &argd[0];  /* sock argument */
  adp->arg_value = sock;
  adp->arg_object = NULL;
  adp->arg_type = AT_BYVAL;
  adp->arg_inout = AI_INPUT;
  adp->arg_size = 0;

  adp++;   /* level argument */
  adp->arg_value = level;
  adp->arg_object = NULL;
  adp->arg_type = AT_BYVAL;
  adp->arg_inout = AI_INPUT;
  adp->arg_size = 0;

  adp++;   /* optname argument */
  adp->arg_value = optname;
  adp->arg_object = NULL;
  adp->arg_type = AT_BYVAL;
  adp->arg_inout = AI_INPUT;
  adp->arg_size = 0;

  adp++;   /* optval argument */
  adp->arg_value = 0;
  adp->arg_object = optval;
  adp->arg_type = AT_BYREF;
  adp->arg_inout = AI_INPUT;
  adp->arg_size = optlen == NULL ? 0 : optlen;

  adp++;   /* optlen argument */
  adp->arg_value = optlen;
  adp->arg_object = NULL;
  adp->arg_type = AT_BYVAL;
  adp->arg_inout = AI_INPUT;
  adp->arg_size = 0;

  error = Psyscall(Pr, &rval, SYS_setsockopt, 5, &argd[0]);

  if (error) {
    errno = (error > 0)? error : ENOSYS;
    return (-1);
  }
  return (0);
}


int main(int argc, char **argv) {
  pid_t pid;
  int fd;
  int perr;
  static struct ps_prochandle *Pr;

  pid = atop(argv[1]);
  fd = atoi(argv[2]);

  if((Pr = Pgrab(pid, PGRAB_NOSTOP, &perr)) == NULL) {
    printf("Pgrab() failed: %s\n", Pgrab_error(perr));
    exit(1);
  }

  int rcvbuf = 0;
  int rcvbuf_size = sizeof(rcvbuf);
  if(pr_getsockopt(Pr, fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, &rcvbuf_size)) {
    perror("pr_getsockopt() failed");
    Prelease(Pr, 0);
    exit(1);
  }

  printf("Current SO_RCVBUF is %d\n", rcvbuf);

  rcvbuf = atoi(argv[3]);
  if(pr_setsockopt(Pr, fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, rcvbuf_size)) {
    perror("pr_setsockopt() failed");
    Prelease(Pr, 0);
    exit(1);
  }

  if(pr_getsockopt(Pr, fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, &rcvbuf_size)) {
    perror("pr_getsockopt() failed");
    Prelease(Pr, 0);
    exit(1);
  }

  printf("New SO_RCVBUF is %d\n", rcvbuf);

  Prelease(Pr, 0);
  Pr = NULL;

  exit(0);  
}

Wednesday, March 02, 2016

Full command line returned by ps

ps command can now show full command line on Solaris 11 as well. Thank you Casper.
For more details see here.

Sunday, August 30, 2015

Remote Management of ZFS servers with Puppet and RAD

Manuel Zach blogs about how to use Puppet and the new Solaris RAD's REST interface introduced in Solaris 11.3. Solaris RAD is really interesting if you want to manage your servers via a programmatic interface.

Friday, August 07, 2015

Kernel Zones - Adding Local Disks

It is possible to add a disk drive to a kernel zone by specifying its physical location intead of CTD.

add device
set storage=dev:chassis/SYS/HDD23/disk
set id=1
end
This is very nice on servers like x5-2l with pass-thru controller when all 26 local disks are visible like this.

Friday, July 10, 2015

Solaris 11.3 Beta

Solaris 11.3 Beta is available now. There are many interesting new features and lots of improvements.
Some of them have already been available if you had access to Solaris Support repository, but if not now you can play with ZFS persistent l2arc, which can also hold compressed blocks, or ZFS/lz4 compression, or perhaps you fancy a new (to Solaris) OpenBSD Packet Filter, or... see What's New for more details on all the new features.

Also see a collection of blog posts which have more technical details about the new features.
New batch of blogs about the new update.

Friday, June 12, 2015

Monday, March 23, 2015

Physical Locations of PCI SSDs

The latest update to Solaris 11 (SRU 11.2.8.4.0) has a new feature - it can identify physical locations of F40 and F80 PCI SSDs cards - it registers them under the Topology Framework.

Here is an example diskinfo output on x4-2l server with 24 SSDs in front presented as JBOD, 2x SSDs in the rear mirrored with RAID controller (for OS), and 4x PCI F80 cards (each card presents 4 LUNs):

$ diskinfo
D:devchassis-path                        c:occupant-compdev
---------------------------------------  ---------------------
/dev/chassis/SYS/HDD00/disk              c0t55CD2E404B64A3E9d0
/dev/chassis/SYS/HDD01/disk              c0t55CD2E404B64B1ABd0
/dev/chassis/SYS/HDD02/disk              c0t55CD2E404B64B1BDd0
/dev/chassis/SYS/HDD03/disk              c0t55CD2E404B649E02d0
/dev/chassis/SYS/HDD04/disk              c0t55CD2E404B64A33Ed0
/dev/chassis/SYS/HDD05/disk              c0t55CD2E404B649DB5d0
/dev/chassis/SYS/HDD06/disk              c0t55CD2E404B649DBCd0
/dev/chassis/SYS/HDD07/disk              c0t55CD2E404B64AB2Fd0
/dev/chassis/SYS/HDD08/disk              c0t55CD2E404B64AC96d0
/dev/chassis/SYS/HDD09/disk              c0t55CD2E404B64A580d0
/dev/chassis/SYS/HDD10/disk              c0t55CD2E404B64ACC5d0
/dev/chassis/SYS/HDD11/disk              c0t55CD2E404B64B1DAd0
/dev/chassis/SYS/HDD12/disk              c0t55CD2E404B64ACF1d0
/dev/chassis/SYS/HDD13/disk              c0t55CD2E404B649EE1d0
/dev/chassis/SYS/HDD14/disk              c0t55CD2E404B64A581d0
/dev/chassis/SYS/HDD15/disk              c0t55CD2E404B64AB9Cd0
/dev/chassis/SYS/HDD16/disk              c0t55CD2E404B649DCAd0
/dev/chassis/SYS/HDD17/disk              c0t55CD2E404B6499CBd0
/dev/chassis/SYS/HDD18/disk              c0t55CD2E404B64AC98d0
/dev/chassis/SYS/HDD19/disk              c0t55CD2E404B6499B7d0
/dev/chassis/SYS/HDD20/disk              c0t55CD2E404B64AB05d0
/dev/chassis/SYS/HDD21/disk              c0t55CD2E404B64A33Fd0
/dev/chassis/SYS/HDD22/disk              c0t55CD2E404B64AB1Cd0
/dev/chassis/SYS/HDD23/disk              c0t55CD2E404B64A3CFd0
/dev/chassis/SYS/HDD24                   -
/dev/chassis/SYS/HDD25                   -
/dev/chassis/SYS/MB/PCIE1/F80/LUN0/disk  c0t5002361000260451d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN1/disk  c0t5002361000258611d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN2/disk  c0t5002361000259912d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN3/disk  c0t5002361000259352d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN0/disk  c0t5002361000262937d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN1/disk  c0t5002361000262571d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN2/disk  c0t5002361000262564d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN3/disk  c0t5002361000262071d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN0/disk  c0t5002361000125858d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN1/disk  c0t5002361000125874d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN2/disk  c0t5002361000194066d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN3/disk  c0t5002361000142889d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN0/disk  c0t5002361000371137d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN1/disk  c0t5002361000371435d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN2/disk  c0t5002361000371821d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN3/disk  c0t5002361000371721d0

Let's create a ZFS pool on top of the F80s and see zpool status output:
(you can use the SYS/MB/... names when creating a pool as well)

$ zpool status -l XXXXXXXXXXXXXXXXXXXX-1
  pool: XXXXXXXXXXXXXXXXXXXX-1
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Mar 21 11:31:01 2015
config:

        NAME                                         STATE     READ WRITE CKSUM
        XXXXXXXXXXXXXXXXXXXX-1                       ONLINE       0     0     0
          mirror-0                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE4/F80/LUN0/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE1/F80/LUN1/disk  ONLINE       0     0     0
          mirror-1                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE4/F80/LUN1/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE1/F80/LUN3/disk  ONLINE       0     0     0
          mirror-2                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE4/F80/LUN3/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE1/F80/LUN2/disk  ONLINE       0     0     0
          mirror-3                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE4/F80/LUN2/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE1/F80/LUN0/disk  ONLINE       0     0     0
          mirror-4                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE2/F80/LUN3/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE5/F80/LUN0/disk  ONLINE       0     0     0
          mirror-5                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE2/F80/LUN2/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE5/F80/LUN1/disk  ONLINE       0     0     0
          mirror-6                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE2/F80/LUN1/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE5/F80/LUN3/disk  ONLINE       0     0     0
          mirror-7                                   ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE2/F80/LUN0/disk  ONLINE       0     0     0
            /dev/chassis/SYS/MB/PCIE5/F80/LUN2/disk  ONLINE       0     0     0

errors: No known data errors

It also means that all FMA alerts should include the physical path as well, which should make identification of a given F80/LUN, if something goes wrong, so much easier.

Saturday, March 21, 2015

ZFS: Persistent L2ARC

Solaris SRU 11.2.8.4.0 delivers persistent L2ARC. What is interesting about it is that it stores raw ZFS blocks, so if you enabled compression then L2ARC will also store compressed blocks (so it can store more data). Similarly with encryption.

Friday, March 20, 2015

Managing Solaris with RAD

Solaris 11 provides "The Remote Administration Daemon, commonly referred to by its acronymand command name, rad, is a standard system service thatoffers secure, remote administrative access to an Oracle Solaris system."

RAD is essentially an API to programmatically manage and query different Solaris subsystems like networking, zones, kstat, smf, etc.

Let's see an example on how to use it to list all zones configured on a local system.

# cat zone_list.py
#!/usr/bin/python

import rad.client as radcli
import rad.connect as radcon
import rad.bindings.com.oracle.solaris.radm.zonemgr_1 as zbind

with radcon.connect_unix() as rc:
    zones = rc.list_objects(zbind.Zone())
    for i in range(0, len(zones)):
        zone = rc.get_object(zones[i])
        print "zone: %s (%S)" % (zone.name, zone.state)
        for prop in zone.getResourceProperties(zbind.Resource('global')):
            if prop.name == 'zonename':
                continue
            print "\t%-20s : %s" % (prop.name, prop.value)

# ./zone_list.py
zone: kz1 (configured)
        zonepath:           :
        brand               : solarisk-kz
        autoboot            : false
        autoshutdown        : shutdown
        bootargs            :
        file-mac-profile    :
        pool                :
        scheduling-class    :
        ip-type             : exclusive
        hostid              : 0x44497532
        tenant              :
zone: kz2 (installed)
        zonepath:           : /system/zones/%{zonename}
        brand               : solarisk-kz
        autoboot            : false
        autoshutdown        : shutdown
        bootargs            :
        file-mac-profile    :
        pool                :
        scheduling-class    :
        ip-type             : exclusive
        hostid              : 0x41d45bb
        tenant              :

Or another example to show how to create a new Kernel Zone with autoboot property set to true:

#!/usr/bin/python

import sys

import rad.client
import rad.connect
import rad.bindings.com.oracle.solaris.radm.zonemgr_1 as zonemgr

class SolarisZoneManager:
    def __init__(self):
        self.description = "Solaris Zone Manager"

    def init_rad(self):
        try:
            self.rad_instance = rad.connect.connect_unix()
        except Exception as reason:
        print "Cannoct connect to RAD: %s" % (reason)
        exit(1)

    def get_zone_by_name(self, name):
        try:
            pat = rad.client.ADRGlobPatter({'name# : name})
            zone = self.rad_instance.get_object(zonemgr.Zone(), pat)
        except rad.client.NotFoundError:
            return None
        except Exception as reason:
            print "%s: %s" % (self.__class__.__name__, reason)
            return None

        return zone

    def zone_get_resource_prop(self, zone, resource, prop, filter=None):
        try:
            val = zone.getResourceProperties(zonemgr.Resource(resource, filter), [prop])
        except rad.client.ObjectError:
            return None
        except Exception as reason:
            print "%s: %s" % (self.__class__.__name__, reason)
            return None

        return val[0].value if val else None

    def zone_set_resource_prop(self, zone, resource, prop, val):
        current_val = self.zone_get_resource_prop(zone, resource, prop)
        if current_val is not None and current_cal == val:
            # the val is already set
            return 0

        try:
            if current_cal is None:
                zone.addResource(zonemgr.Resource(resource, [zonemgr.Property(prop, val)]))
            else:
                zone.setResourceProperties(zonemgr.Resource(resource), [zonemgr.Property(prop, val)])
        except rad.client.ObjectError as err:
            print "Failed to set %s property on %s resource for zone %s: %s" % (prop, resource, zone.name, err)
            return 0

        return 1

    def zone_create(self, name, template):
        zonemanager = self.rad_instance.get_object(zonemg.ZoneManager())
        zonemanager.create(name, None, template)
        zone = self.get_zone_by_name(name)
        
        try:
            zone.editConfig()
            self.zone_set_resource_prop(zone, 'global', 'autoboot', true')
            zone.commitConfig()
        except Exception as reason:
            print "%s: %s" % (self.__class__.__name__, reason)
            return 0
 
        return 1

x = SolarisZoneManager()
x.init_rad()
if x.zone_create(str(sys.argv[1]), 'SYSsolaris-kz'):
    print "Zone created succesfully." 

There are many simple examples in  zonemgr.3rad man page, and what I found very useful is to look at solariszones/driver.py from OpenStack. It is actually very interesting that OpenStack is using RAD on Solaris.

RAD is very powerful, and with more modules being constantly added it is becoming a  powerful programmatic API to remotely manage Solaris systems. It is also very useful if you are writing components to a configuration management system for Solaris.

What's the most anticipated RAD module currently missing in stable Solaris? I think it is ZFS module...