Friday, December 03, 2010

Religion in IT

Joerg posted:
Interesting statement in a searchdatacenter article about the IDC numbers
“When you sell against Dell, you sell against price. When you sell against HP, you sell against technical stuff -- the feeds and speeds. When you're up against IBM, you're not selling against boxes but against solutions or business outcomes that happen to include hardware. But, when you get to the Sun guys, it's about religion. You can't get to those guys. One guy told me last year that he would get off his Sun box when he dies."

Thursday, December 02, 2010

Linux, O_SYNC and Write Barriers

We all love Linux... sometimes it is better not to look under its hood though as you never know what you might find.

I stumbled across a very interesting discussion on a Linux kernel mailing list. It is dated August 2009 so you may have already read it.

There is a related RH bug.

I'm a little bit surprised by RH attitude in this ticket. IMHO they should have fixed it and maybe provide a tunable which would enable/disable new behavior instead of keeping the broken implementation. But at least in recent man pages they have clarified it in the Notes section of open(2):
"POSIX provides for three different variants of synchronized I/O, corresponding to the flags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31), Linux only implements O_SYNC, but glibc maps O_DSYNC and O_RSYNC to the same numerical value as O_SYNC. Most Linux file systems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a write to be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and metadata necessary to retrieve it to be on disk by the time the system call returns."

Then there is another even more interesting discussion about write barriers:
"All of them fail to commit drive caches under some circumstances;
even fsync on ext3 with barriers enabled (because it doesn't
commit a journal record if there were writes but no inode change
with data=ordered)."
and also this one:
"No, fsync() doesn't always flush the drive's write cache. It often
does, any I think many people are under the impression it always does, but it doesn't.

Try this code on ext3:

fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);

while (1) {
    char byte;
    usleep (100000);
    pwrite (fd, &byte, 1, 0);
    fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode has changed. The inode mtime is changed by write only with 1 second granularity. Without a journal commit, there's no barrier, which translates to not flushing disk write cache.

If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more. That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals. A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance. I'm not sure if ordered requests are actually implemented
by any drivers at the moment. If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend on the non-existence of block drivers which do ordered (not flush) barrier requests. But there's lots of things wrong with that. Not least, it sucks performance for database-like applications and virtual machines, a lot due to unnecessary seeks. That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need."
This is really scary. I wonder how many developers knew about it especially when coding for Linux when data safety was paramount. Sometimes it feels that some Linux developers are coding to win benchmarks and do not necessarily care about data safety, correctness and standards like POSIX. What is even worse is that some of them don't even bother to tell you about it in official documentation (at least the O_SYNC/O_DSYNC issue is documented in the man page now).


Monday, November 15, 2010

Solaris 11 Express

It is based on build 151. The interesting thing is that you can buy a standard support for it
and it doesn't look like Oracle is treating it as beta - at least not officially.
But then why not call it Solaris 11? I guess it is partly due to marketing reasons and partly because it is not entirely ready yet and there are some components which require further work.

Monday, October 04, 2010

ZFS Encryption

It looks like ZFS crypto project has finally been integrated. It's been reported by other blogger here and the bug id: 4854202 ZFS data set encryption has been updated to reflect that it is in snv_149. Congratulations to Darren Moffat! - the guy behind the project.

It took much longer than expected. If I were to speculate I would say the integration was delayed on purpose so it gets integrated only after the public access to onnv gate was closed so 3rd parties like Nexenta could not take advantage of it before Oracle can. Hopefully once Solaris 11 is out we will see the source code as well. It also probably means that it will be in Solaris 11 Express as well.

Wednesday, September 29, 2010

Nexenta's take on ZFS, Oracle and NetApp

Interesting post from Nexenta on recent developments in regards to ZFS, Oracle and NetApp. There is also a podcast which is a Q&A session with Evan Powell, CEO of Nexenta.

Friday, September 24, 2010

Oracle to buy NetApp?

The Register:
"Oracle boss Larry Ellison said he'd love to have the 60 per cent of NetApp's business that plugs NetApp boxes into Oracle software, hinting that a NetApp purchase could be on his mind.

[...]

A combination of Oracle and NetApp would be a truly formidable server/storage play, even if the IBM NetApp reseller deal subsequently collapsed. Will Oracle actually make an approach? Will NetApp respond positively? The key there is whether it truly believes an independent, best-of-breed stance is viable or whether the stack-in-a-box, centralised IT approach is a sustained tank attack that threatens to flatten its business.

A future inside Oracle in those circumstances might look a lot better than a slow fade into a Unisys-like state while EMC, buoyed up with VMware and products above raw storage arrays, grows and prospers."

Bloomberg also reports on Oracle acquisitions plans.

New SPARC T3 Systems

Joerg reports:
Oracle announced several new SPARC systems based on the SPARC T3 processor. SPARC T3 is the next iteration of SPARC in the throughput arena: 16 cores running at 1.65 GHz., so 128 threads per socket, on-chip PCIe 2.0, on-chip 10GBe ...

It starts with one 16-core 1.65GHz SPARC T3 processor on the SPARC T3-1 Server . So you get 128 Threads packaged in a 2 rack units chassis.

Above this you will find the SPARC T3-2 with two SPARC T3 processors. It gives you 256 threads and 256 GB. You get this amount of power in 3 RU.

At the high-end is the T3-4 with four of the SPARC T3 CPU giving you 512 threads, 512 GB of main memory with 8 GB DIMMs and 16 PCI Express Modules. This system is somewhat larger: 5 RU.

At last there is a new blade for the Blade 6000 chassis: SPARC T3-1B. As the name suggests it provides 1 SPARC T3 processor, so 128 Threads on one blade.
The Register reports on the new systems as well.

If you are interested in a brief summary of what T3 is read Joerg post about it here.

Tuesday, September 14, 2010

OpenIndiana

See the announcement and slides. You can also download ISOs here which are based on snv_147.

So it has begun... Is it going to survive in the long term?

Friday, September 10, 2010

OpenIndiana

http://openindiana.org

OpenIndiana is a continuation of the OpenSolaris operating system. It was conceived during the period of uncertainty following the Oracle takeover of Sun Microsystems, after several months passed with no binary updates made available to the public. The formation proved timely, as Oracle discontinued OpenSolaris soon after in favour of Solaris 11 Express, a binary distribution with a more closed development model to debut later this year.

OpenIndiana is part of the Illumos Foundation, and provides a true open source community alternative to Solaris 11 and Solaris 11 Express, with an open development model and full community participation.

Announcement Details

We will be holding a press conference in London (UK) at the JISC offices at 6:30pm UK Time (BST, GMT+1) to formally announce the project and provide full details, including information on gaining access to download our first development release. In addition, we will be broadcasting live on the internet – please see http://openindiana.org/announcement for details of attending in person or on the web.

We believe this announcement will deliver the distribution the community has long sought after.

Solaris 10 9/10

What's New:
  • Installation Enhancements
    • SPARC: Support for ITU Construction Tools on SPARC Platforms - In this release, the itu utility has been modified to support booting a SPARC based system with the install-time updates (ITU) process.
    • Oracle Solaris Auto Registration - before you ask ... you can disable it
    • Oracle Solaris Upgrade Enhancement for Oracle Solaris Zone– Cluster Nodes
  • Virtualization Enhancements for Oracle Solaris Zones
    • Migrating a Physical Oracle Solaris 10 System Into a Zone - cool, my RfE found its way into Solaris 10
    • Host ID Emulation
    • Updating Packages by Using the New zoneadm attach -U Option
  • Virtualization Enhancements for Oracle VM Server for SPARC (formerly known as LDOMs)
    • Memory Dynamic Reconfiguration Capability
    • Virtual Disk Multipathing Enhancements
    • Static Direct I/O
    • Virtual Domain Information Command and API - the virtinfo command
  • System Administration Enhancements
    • Oracle Solaris ZFS Features and Enhancements
    • Fast Crash Dump
    • x86: Support for the IA32_ENERGY_PERF_BIAS MSR
    • Support for Multiple Disk Sector Size
    • iSCSI Initiator Tunables
    • Sparse File Support in the cpio Command
    • x86: 64-Bit libc String Functions Improvements With SSE
    • Automated Rebuilding of sendmail Configuration Files
    • Automatic Boot Archive Recovery
  • Security Enhancements
    • net_access Privilege
    • x86: Intel AES-NI Optimization
  • Language Support Enhancements
    • New Oracle Solaris Unicode Locales
  • Device Management Enhancements
    • iSCSI Boot
    • iSER Initiator
    • New Hot-Plugging Features
    • AAC RAID Power Management
  • Driver Enhancements
    • x86: HP Smart Array HBA Driver
    • x86: Support for Broadcom NetXtreme II 10 Gigabit Ethernet NIC Driver
    • x86: New SATA HBA Driver, bcm_sata, for Broadcom HT1000 SATA Controllers
    • Support for SATA/AHCI Port Multiplier
    • Support for Netlogic NLP2020 PHY in the nxge Driver
  • Freeware Enhancements
    • GNU TAR Version 1.23
    • Firefox 3.5
    • Thunderbird 3
    • Less Version 436
  • Networking Enhancements
    • BIND 9.6.1 for the Oracle Solaris 10 OS
    • GLDv3 Driver APIs
    • IPoIB Connected Mode
    • Open Fabrics User Verbs Primary Kernel Components
    • InfiniBand Infrastructure Enhancements
  • X11 Windowing Enhancements
    • Support for the setxkbmap Command
  • New Chipset Support
    • ixgbe Driver to Integrate Intel Shared Code Version 3.1.9
    • Broadcom Support to bge Networking Driver
    • x86: Fully Buffered DIMM Idle Power Enhancement
  • Fault Management Architecture Enhancements
    • FMA Support for AMD's Istanbul Based Systems
    • Several Oracle Solaris FMA Enhancement
  • Diagnostic Tools Enhancements
    • Sun Validation Test Suite 7.0ps9
    • Enhancements to the mdb Command to Improve the Debugging Capability of kmem and libumem
You wil find a more in-depth description at docs.sun.com.

NetApp, Oracle Drop ZFS LawSuits

The Wall Street Journal:

"NetApp Inc. and Oracle Corp. have agreed to dismiss patent lawsuits against each other, putting to bed a battle that had been ongoing since 2007.

The terms of the agreement weren't disclosed, and the companies said they are seeking to have the lawsuits dismissed without prejudice.

NetApp President and Chief Executive Tom Georgens said Thursday the companies would continue to collaborate in the future.

NetApp, which sells data-storage systems, and Sun Microsystems, which has since been bought by Oracle, began the fight in 2007, with each company alleging the other was infringing some of its patents.

Lawsuits were filed on both companies' behalf, with NetApp alleging that Sun's ZFS file-system-management technology infringed a number of its patents. Sun at the time also said NetApp products infringed its patents."

See also The Register article.

Saturday, August 28, 2010

The Future of Solaris

Below is a collection of blog entries on what's going on recently in regards to Open Solaris and Solaris 11.

Wednesday, August 04, 2010

Saturday, July 31, 2010

SMF/FMA Update

A rather large and interesting putback for SMF/FMA related technologies went into Open Solaris yesterday. It will be available in build 146.
PSARC/2009/617 Software Events Notification Parameters CLI
PSARC/2009/618 snmp-notify: SNMP Notification Daemon for Software Events
PSARC/2009/619 smtp-notify: Email Notification Daemon for Software Events
PSARC/2010/225 fmd for non-global Solaris zones
PSARC/2010/226 Solaris Instance UUID
PSARC/2010/227 nvlist_nvflag(3NVPAIR)
PSARC/2010/228 libfmevent additions
PSARC/2010/257 sysevent_evc_setpropnvl and sysevent_evc_getpropnvl
PSARC/2010/265 FMRI and FMA Event Stabilty, 'ireport' category 1 event class, and the 'sw' FMRI scheme
PSARC/2010/278 FMA/SMF integration: instance state transitions
PSARC/2010/279 Modelling panics within FMA
PSARC/2010/290 logadm.conf upgrade
6392476 fmdump needs to pretty-print
6393375 userland ereport/ireport event generation interfaces
6445732 Add email notification agent for FMA and software events
6804168 RFE: Allow an efficient means to monitor SMF services status changes
6866661 scf_values_destroy(3SCF) will segfault if is passed NULL
6884709 Add snmp notification agent for FMA and software events
6884712 Add private interface to tap into libfmd_msg macro expansion capabilities
6897919 fmd to run in a non-global zone
6897937 fmd use of non-private doors is not safe
6900081 add a UUID to Solaris kernel image for use in crashdump identification
6914884 model panic events as a defect diagnosis in FMA
6944862 fmd_case_open_uuid, fmd_case_uuisresolved, fmd_nvl_create_defect
6944866 log legacy sysevents in fmd
6944867 enumerate svc scheme in topo
6944868 software-diagnosis and software-response fmd modules
6944870 model SMF maintenance state as a defect diagnosis in FMA
6944876 savecore runs in foreground for systems with zfs root and dedicated dump
6965796 Implement notification parameters for SMF state transitions and FMA events
6968287 SUN-FM-MIB.mib needs to be updated to reflect Oracle information
6972331 logadm.conf upgrade PSARC/2010/290

Thursday, July 29, 2010

Dell and HP Continue Supporting Solaris

New announcement about Solaris support on non-Oracle servers:
  • Oracle today announced Dell and HP will certify and resell Oracle Solaris, Oracle Enterprise Linux and Oracle VM on their respective x86 platforms.
  • Customers will have full access to Oracle’s Premier Support for Oracle Solaris, Oracle Enterprise Linux and Oracle VM running on Dell and HP servers. This will enable fast and accurate issue resolution and reduced risk in a company’s operating environment.
  • Customers who subscribe to Oracle Premier Support will benefit from Oracle’s continuing investment in Oracle Solaris, Oracle Enterprise Linux and Oracle VM and the resulting innovation in future updates.

Wednesday, June 16, 2010

zpool scrub - 4GB/s

Open Solaris 2009.06 + entry level 2U x86 server Sun Fire x4270 + 1U xxxx array.

# iostat -xnzCM 1|egrep "device|c[0123]$"
[...]
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
8182.1 0.0 1022.1 0.0 0.1 152.8 0.0 18.7 0 1077 c0
8179.1 0.0 1021.7 0.0 0.1 148.7 0.0 18.2 0 1076 c1
8211.0 0.0 1025.9 0.0 0.1 162.8 0.0 19.8 0 1081 c2
8218.0 0.0 1026.8 0.0 0.1 164.5 0.0 20.0 0 1085 c3
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
8080.2 0.0 1010.0 0.0 0.1 168.3 0.0 20.8 0 1070 c0
8080.2 0.0 1010.0 0.0 0.1 167.6 0.0 20.7 0 1071 c1
8165.2 0.0 1020.3 0.0 0.1 166.0 0.0 20.3 0 1079 c2
8157.2 0.0 1019.3 0.0 0.1 151.4 0.0 18.6 0 1080 c3
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
8192.0 0.0 1023.4 0.0 0.1 174.6 0.0 21.3 0 1085 c0
8190.9 0.0 1023.1 0.0 0.1 174.2 0.0 21.3 0 1085 c1
8140.9 0.0 1016.9 0.0 0.1 145.5 0.0 17.9 0 1078 c2
8138.9 0.0 1016.7 0.0 0.1 142.7 0.0 17.5 0 1075 c3
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
8129.1 0.0 1015.6 0.0 0.1 153.0 0.0 18.8 0 1066 c0
8125.2 0.0 1015.1 0.0 0.1 155.1 0.0 19.1 0 1067 c1
8156.2 0.0 1018.8 0.0 0.1 162.1 0.0 19.9 0 1074 c2
8159.2 0.0 1019.2 0.0 0.1 162.0 0.0 19.9 0 1076 c3
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
8177.9 0.0 1022.0 0.0 0.1 165.0 0.0 20.2 0 1088 c0
8184.9 0.0 1022.9 0.0 0.1 165.2 0.0 20.2 0 1085 c1
8209.9 0.0 1026.1 0.0 0.1 162.4 0.0 19.8 0 1085 c2
8204.9 0.0 1025.5 0.0 0.1 161.8 0.0 19.7 0 1087 c3
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
8236.4 0.0 1029.2 0.0 0.1 170.1 0.0 20.7 0 1092 c0
8235.4 0.0 1029.0 0.0 0.1 170.2 0.0 20.7 0 1093 c1
8215.4 0.0 1026.4 0.0 0.1 165.3 0.0 20.1 0 1091 c2
8220.4 0.0 1027.0 0.0 0.1 164.9 0.0 20.1 0 1090 c3


Then with a small I/O it can sustain over 400k IOPS - more HBAs should deliver even more performance.

It is really amazing how fast technology is progressing.
To achieve above numbers 10 years ago it would have cost a small fortune.

Saturday, June 12, 2010

Heat Maps

Brendan Gregg wrote an article in ACM Queue about Visualizing System Latency as heat maps. The article explains really well what latency heat maps are and how to read them. It is also a good read if you want to learn about a rainbow pterodactyl (shown below) flying over an icy lake inside a disk array.

Thursday, June 10, 2010

Open Solaris Roadmap

It is rather brief but at least it is something.
See also a nice commercial below.


Friday, May 28, 2010

DTrace TCP and UDP providers

Last night two new DTrace providers were integrated.
They should be available in a build 142 of Open Solaris.

PSARC 2010/106 DTrace TCP and UDP providers
"This case adds DTrace 'tcp' and 'udp' providers with probes
for send and receive events. These providers cover the TCP
and UDP protocol implementations in OpenSolaris respectively. In
addition the tcp provider contains probes for TCP state machine
transitions and significant events in connection processing
(connection request, acceptance, refusal etc). The udp provider
also contains probes which fire when a UDP socket is opened/closed.
This is intended for use by customers for network observability and
troubleshooting, and this work represents the second and third
components of a suite of planned providers for the network stack. The
first was described in PSARC/2008/302 DTrace IP Provider."

The tcp provider is described here:

http://wikis.sun.com/display/DTrace/tcp+Provider

...and the udp provider is described here:

http://wikis.sun.com/display/DTrace/udp+Provider

Tuesday, May 04, 2010

ZFS - synchronous vs. asynchronous IO

Sometimes it is very useful to be able to disable a synchronous behavior of a filesystem. Unfortunately not all applications provide such functionality. With UFS many used fastfs from time to time, however the problem is that it can potentially lead to a filesystem corruption. In case of ZFS many people have been using an undocumented zil_disable tunable. While it can cause a data corruption from an application point of view it doesn't impact ZFS on-disk consistency. This is good as it makes the feature very useful, with a much smaller risk but can greatly improve a performance in some cases like database imports, nfs servers, etc. The problem with the tunable is that it is unsupported, has a server-wide impact and affects only newly mounted zfs filesystems while has an instant effect on zvols.

From time to time there were requests here and there to get it implemented properly in a fully supported way. I thought it might be a good opportunity to re-fresh my understanding of Open Solaris and ZFS internals so a couple of months ago I decided to implement it under: 6280630 zil synchronicity.
And it was a fun - I really enjoyed it. I spent most of the time trying to understand the interactions between ZIL/VNODE/VFS layers and the structure of ZFS code. I was already familiar with it to some extend as I contributed a code to ZFS in the past and I also do read the code from time to time when I do some performance tuning, etc. Once I understood what's going on there it was really easy to do the actual coding. Once I got a basic functionality working and I asked for a sponsor so it gets integrated. Tim Haley offered to sponsor me and help me to get it integrated. Couple of moths later, after a PSARC case, code reviews, email exchanges, testing it got finally integrated and should appear in build 140.

I would like to thank Tim Haley, Mark Musante and Neil Perin for all their comments, code reviews, testing, PSARC case handling, etc. It was a real pleasure to work with you.


PSARC/2010/108 zil synchronicity

ZFS datasets now have a new 'sync' property to control synchronous behavior.
The zil_disable tunable to turn synchronous requests into asynchronous requests (disable the ZIL) has been removed. For systems that use that switch on upgrade you will now see a message on booting:

sorry, variable 'zil_disable' is not defined in the 'zfs' module

Please update your system to use the new sync property.
Here is a summary of the property:

-------

The options and semantics for the zfs sync property:

sync=standard
This is the default option. Synchronous file system transactions
(fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
and then secondly all devices written are flushed to ensure
the data is stable (not cached by device controllers).

sync=always
For the ultra-cautious, every file system transaction is
written and flushed to stable storage by a system call return.
This obviously has a big performance penalty.

sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds. This option gives the
highest performance. However, it is very dangerous as ZFS
is ignoring the synchronous transaction demands of
applications such as databases or NFS.
Setting sync=disabled on the currently active root or /var
file system may result in out-of-spec behavior, application data
loss and increased vulnerability to replay attacks.
This option does *NOT* affect ZFS on-disk consistency.
Administrators should only use this when these risks are understood.

The property can be set when the dataset is created, or dynamically, and will take effect immediately. To change the property, an administrator can use the standard 'zfs' command. For example:

# zfs create -o sync=disabled whirlpool/milek
# zfs set sync=always whirlpool/perrin


Have a fun!

Thursday, April 29, 2010

Gartner on Oracle/Sun New Support Model

"Oracle announces its new Sun support policy that has the potential to radically change the way in which OEMs offer support and may make third-party maintenance offerings for Sun hardware unprofitable."
Read more.

Saturday, April 24, 2010

Oracle Solaris on HP ProLiant Servers

Recently there has been lots of confusion regarding running Solaris 10 on non-Sun servers.

HP Oracle Solaris 10 Subscriptions and Support
:
"Certifying Oracle Solaris on ProLiant servers since 1996, HP is expanding its relationship with Oracle to include selling Oracle Solaris 10 Operating System Subscriptions and support from HP Technology Services on certified ProLiant servers.

HP will provide the subscriptions and support for the Oracle Solaris 10 Operating System on certified ProLiant servers and Oracle will provide patches and updates directly to HP's customers through Oracle SunSolve.

As part of this expanded relationship HP and Oracle will work together to enhance the customer experience for Oracle Solaris on ProLiant servers and HP increase its participation in the OpenSolaris community."

And of course you can also run Open Solaris on any x86 hardware, including HP servers, entirely for free if you want. I wonder though if it would make sense for HP to also offer support for Open Solaris - more and more customers are deploying Open Solaris instead of Solaris 10 on their servers and Oracle already offers a support for it on their own servers.

Monday, March 29, 2010

ZFS diff


PSARC/2010/105
:

There is a long-standing RFE for zfs to be able to describe
what has changed between the snapshots of a dataset.
To provide this capability, we propose a new 'zfs diff'
sub-command. When run with appropriate privilege the
sub-command describes what file system level changes have
occurred between the requested snapshots. A diff between the
current version of the file system and one of its snapshots is
also supported.

Five types of change are described:

o File/Directory modified
o File/Directory present in older snapshot but not newer
o File/Directory present in newer snapshot but not older
o File/Directory renamed
o File link count changed

      zfs diff snapshot  snapshot | filesystem

Gives a high level description of the differences between a
snapshot and a descendant dataset. The descendant may either
be a later snapshot of the dataset or the current dataset.
For each file system object that has undergone a change
between the original snapshot and the descendant, the type of
change is described along with the name of the file or
directory. In the case of a rename, both the old and new
names are shown.

The type of change is described with a single character:

+ Indicates the file/directory was added in the later dataset
- Indicates the file/directory was removed in the later dataset
M Indicates the file/directory was modified in the later dataset
R Indicates the file/directory was renamed in the later dataset

If the modification involved a change in the link count of a
file, the change will be expressed as a delta within
parentheses on the modification line. Example outputs are
below:

M /myfiles/
M /myfiles/link_to_me (+1)
R /myfiles/rename_me -> /myfiles/renamed
- /myfiles/delete_me
+ /myfiles/new_file

Saturday, March 27, 2010

Project Brussels Phase II

This project introduces a new CLI utility called ipadm(1M) that can be used to perform:

* IP interfaces management (creation/deletion)
* IP address management (add, delete, show) for static IPv4& IPv6 addresses, DHCP, stateless/stateful IPv6 Address configuration
* protocol (IP/TCP/UDP/SCTP/ICMP) tunable management (set, get, reset) global (ndd(1M)) tunables, as well as per-interface tunables
* provide persistence for all of the three features above so that on reboot the configuration is reapplied

Please see the case materials of PSARC 2010/080 for the latest design document and read ipadm(1M) man page for more information.


This has been integrated into Open Solaris.

Friday, March 26, 2010

CPU/MEM HotPlug on x86 in Open Solaris

The integration of:

PSARC/2009/104 Hot-Plug Support for ACPI-based Systems
PSARC/2009/550 PSMI extensions for CPU Hotplug
PSARC/2009/551 acpihpd ACPI Hotplug Daemon
PSARC/2009/591 Attachment Points for Hotpluggable x86 systems
6862510 provide support for cpu hot add on x86
6874842 provide support for memory hot add on x86
6883891 cmi interface needs to support dynamic reconfiguration
6884154 x2APIC and kmdb may not function properly during CPU hotplug event.
6904971 low priority acpi nexus code review feedback
6877301 lgrp should support memory hotplug flag in SRAT table

Introduces support for hot-adding cpus and memory to a running Xeon 7500 platform.

Sunday, February 28, 2010

ReadyBoost

I didn't know that Windows has a similar technology to ZFS L2ARC which is called ReadyBoost. Nice.

I'm building my new home NAS server and I'm currently seriously considering putting OS on an USB pen drive leaving all sata disks for data only. It looks like with modern USB drives OS should actually boot faster than from a sata disks thanks to much better seek times. I'm planning on doing some experiments first.

Thursday, February 25, 2010

ZVOLs - Write Cache

When you create a ZFS volume its write cache is disabled by default meaning that all writes to the volume will be synchronous. Sometimes it might be handy though to be able to enable a write cache for a particular zvol. I wrote a small C program which allows you to check if WC is enabled or not. It also allows you to enable or disable write cache for a specified zvol.

First lets check if write cache is disabled for a zvol rpool/iscsi/vol1

milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1
Write Cache: disabled

Now lets issue 1000 writes

milek@r600:~/progs# ptime ./sync_file_create_loop /dev/zvol/rdsk/rpool/iscsi/vol1 1000

real 12.013566363
user 0.003144874
sys 0.104826470

So it took 12s and I also confirmed that writes were actually being issued to a disk drive. Lets enable write cache now and repeat 1000 writes

milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1 1
milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1
Write Cache: enabled

milek@r600:~/progs# ptime ./sync_file_create_loop /dev/zvol/rdsk/rpool/iscsi/vol1 1000

real 0.239360231
user 0.000949655
sys 0.019019552

Worked fine.

The zvol_wce program is not idiot-proof and it doesn't check if operation succeeded or not. You should be able to compile it by issuing: gcc -o zvol_wce zwol_wce.c

milek@r600:~/progs# cat zvol_wce.c

/* Robert Milkowski
http://milek.blogspot.com
*/

#include <unistd.h>
#include <stropts.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stropts.h>
#include <sys/dkio.h>


int main(int argc, char **argv)
{
char *path;
int wce = 0;
int rc;
int fd;

path = argv[1];

if ((fd = open(path, O_RDONLY|O_LARGEFILE)) == -1)
exit(2);

if (argc>2) {
wce = atoi(argv[2]) ? 1 : 0;
rc = ioctl(fd, DKIOCSETWCE, &wce);
}
else {
rc = ioctl(fd, DKIOCGETWCE, &wce);
printf("Write Cache: %s\n", wce ? "enabled" : "disabled");
}

close(fd);
exit(0);
}

Wednesday, February 10, 2010

Dell - No 3rd Party Disk Drives Allowed

Third-party drives not permitted:
"[...]
Is Dell preventing the use of 3rd-party HDDs now?
[....]
Howard_Shoobe at Dell.com:

Thank you very much for your comments and feedback regarding exclusive use of Dell drives. It is common practice in enterprise storage solutions to limit drive support to only those drives which have been qualified by the vendor. In the case of Dell's PERC RAID controllers, we began informing customers when a non-Dell drive was detected with the introduction of PERC5 RAID controllers in early 2006. With the introduction of the PERC H700/H800 controllers, we began enabling only the use of Dell qualified drives. There are a number of benefits for using Dell qualified drives in particular ensuring a positive experience and protecting our data. While SAS and SATA are industry standards there are differences which occur in implementation. An analogy is that English is spoken in the UK, US and Australia. While the language is generally the same, there are subtle differences in word usage which can lead to confusion. This exists in storage subsystems as well. As these subsystems become more capable, faster and more complex, these differences in implementation can have greater impact. Benefits of Dell's Hard Disk and SSD drives are outlined in a white paper on Dell's web site at http://www.dell.com/downloads/global/products/pvaul/en/dell-hard-drives-pov.pdf"

I understand they won't support 3rd party disk drives but blocking a server (a RAID card) from using such disks is something new - an interesting comment here.

IBM 2010: Customers in Revolt

From my own experience their sales people are very aggressive with an attitude of sale first and let someone else worry later. While I always take any vendor claims with a grain of salt I learnt to double or even triple check any IBM's claims.

Tuesday, February 09, 2010

Power 7

Now lets wait for some benchmarks. I only wish Solaris was running on them as well as right now you need to go the legacy AIX route or not so mature Linux route - not an ideal choice.

Thursday, February 04, 2010

Data Corruption - ZFS saves the day, again

We came across an interesting issue with data corruption and I think it might be interesting to some of you. While preparing a new cluster deployment and filling it up with data we suddenly started to see below messages:

XXX cl_runtime: [ID 856360 kern.warning] WARNING: QUORUM_GENERIC: quorum_read_keys error:
Reading the registration keys failed on quorum device /dev/did/rdsk/d7s2 with error 22.

The d7 quorum device was marked as being offline and we could not bring it online again. There isn't much in documentation about the above message except that it is probably a firmware problem on a disk array and we should contact a vendor. But lets investigate first what is really going on.

By looking at the source code I found that the above message is printed from within quorum_device_generic_impl::quorum_read_keys() and it will only happen if quorum_pgre_key_read() returns with return code 22 (actually any other than 0 or EACCESS but from the syslog message we already suspect that the return code is 22).

The quorum_pgre_key_read() calls quorum_scsi_sector_read() and passes its return code as its own. The quorum_scsi_sector_read() will return with an error only if quorum_ioctl_with_retries() returns with an error or if there is a checksum mismatch.

This is the relevant source code:

406 int
407 quorum_scsi_sector_read(
[...]
449 error = quorum_ioctl_with_retries(vnode_ptr, USCSICMD, (intptr_t)&ucmd,
450 &retval);
451 if (error != 0) {
452 CMM_TRACE(("quorum_scsi_sector_read: ioctl USCSICMD "
453 "returned error (%d).\n", error));
454 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
455 return (error);
456 }
457
458 //
459 // Calculate and compare the checksum if check_data is true.
460 // Also, validate the pgres_id string at the beg of the sector.
461 //
462 if (check_data) {
463 PGRE_CALCCHKSUM(chksum, sector, iptr);
464
465 // Compare the checksum.
466 if (PGRE_GETCHKSUM(sector) != chksum) {
467 CMM_TRACE(("quorum_scsi_sector_read: "
468 "checksum mismatch.\n"));
469 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
470 return (EINVAL);
471 }
472
473 //
474 // Validate the PGRE string at the beg of the sector.
475 // It should contain PGRE_ID_LEAD_STRING[1|2].
476 //
477 if ((os::strncmp((char *)sector->pgres_id, PGRE_ID_LEAD_STRING1,
478 strlen(PGRE_ID_LEAD_STRING1)) != 0) &&
479 (os::strncmp((char *)sector->pgres_id, PGRE_ID_LEAD_STRING2,
480 strlen(PGRE_ID_LEAD_STRING2)) != 0)) {
481 CMM_TRACE(("quorum_scsi_sector_read: pgre id "
482 "mismatch. The sector id is %s.\n",
483 sector->pgres_id));
484 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
485 return (EINVAL);
486 }
487
488 }
489 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
490
491 return (error);
492 }

With a simple DTrace script I could verify if the quorum_scsi_sector_read() does indeed return with 22 and also I could print what else is going on within the function:

56 -> __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 6308555744942019 enter
56 -> __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 6308555744957176 enter
56 <- __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 6308555745089857 rc: 0
56 -> __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745108310 enter
56 -> __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745120941 enter
56 -> __1cCosHsprintf6FpcpkcE_v_ 6308555745134231 enter
56 <- __1cCosHsprintf6FpcpkcE_v_ 6308555745148729 rc: 2890607504684
56 <- __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745162898 rc: 1886718112
56 <- __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745175529 rc: 1886718112
56 <- __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 6308555745188599 rc: 22

From the above output we know that the quorum_ioctl_with_retries() returns with 0 so it must be a checksum mismatch! As CMM_TRACE() is being called above and there are only three of them in the code lets check with DTrace which one it is:

21 -> __1cNdbg_print_bufIdbprintf6MpcE_v_ 6309628794339298 quorum_scsi_sector_read: checksum mismatch.

So now I knew exactly what part of the code is casing the quorum device to be marked offline. The issue might have been caused by many things like: a bug in a disk array firmware, a problem on an SAN, a bug in a HBA's firmware, a bug in a qlc driver or a bug in SC software, or... However because the issue suggests a data corruption and we are loading the cluster with a copy of a database we might have a bigger issue that just an offline quorum device. The configuration is a such that we are using ZFS to mirror between two disks arrays. We have been restoring a couple of TBs of data into and we haven't read almost anything back. Thankfully it is ZFS so we might force a re-check off all data in the pool and I did. ZFS found 14 corrupted blocks and even identified which file is affected. The interesting thing here is that for all blocks both copies on both sides of the mirror were affected. This almost eliminates a possibility of a firmware problem on disk arrays and suggest that the issue was caused by something misbehaving on the host itself. There is still a possibility of an issue on SAN as well. It is very unlikely to be a bug in ZFS as the corruption affected reservation keys as well which has basically nothing to do with ZFS at all. Then we are still writing more and more data into the pool and I'm repeating scrubs and I'm not getting any new corrupted blocks nor quorum is misbehaving (I fixed it by temporarily adding another one, removing the original and re-adding it again while removing the temporary one).

While I still have to find what caused the data corruption the most important thing here is ZFS. Just think about it - what would happen if we were running on any other file system like: UFS, VxFS, ext3, ext4, JFS, XFS, ... Well, almost anything could have happened with them like some data of could be corrupted, some files lost, system could crash, fsck could be forced to run for many hours and still not being able to fix the filesystem and it definitely wouldn't be able to detect any data corruption withing files or everything would be running fine for days, months and then suddenly the system would panic, etc. when application would try to access the corrupted blocks for the first time. Thanks to ZFS what have actually happened? All corrupted blocks were identified, unfortunately both mirrored copies were affected so ZFS can't fix them but it did identified a single file which was affected by all these blocks. We can just remove the file which is only 2GB and restore it again. And all of these while the system was running and we haven't even stopped the restore or didn't have to start from the beginning. Most importantly there is no uncertainty about the state of the filesystem or data within it.

The other important conclusion is that DTrace is a sysadmin's best friend :)


Thursday, January 21, 2010

The EC gives a green light

The European Commission clears Oracle's proposed acquisition of Sun Microsystems:
"The European Commission has approved under the EU Merger Regulation the proposed acquisition of US hardware and software vendor Sun Microsystems Inc. by Oracle Corporation, a US enterprise software company. After an in-depth examination, launched in September 2009 (see IP/09/1271 ), the Commission concluded that the transaction would not significantly impede effective competition in the European Economic Area (EEA) or any substantial part of it."

Friday, January 15, 2010

MySQL TOP

This blog entry has been updated:
  • added NCQRS, NCQRS/s columns
  • fixed issue with dtrace dropping variables if the script was running for extended time periods
  • cosmetic changes re output

I need to observe MySQL load from time to time and DTrace is one of the tools to use. Usually I'm using one-liners or I come up with a short script. This time I thought it would be nice to write a script so other people like DBAs could use without having to understand how it actually works. The script prints basic statistics for each client connecting to a database. It gives a nice overview for all clients using a database.

CLIENT IP CONN CONN/s QRS QRS/s NCQRS NCQRS/s TIME VTIME
10.10.10.35 10 0 61 0 32 0 0 0
10.10.10.30 17 0 73 0 73 0 0 0
10.10.10.100 52 0 90 0 90 0 0 0
xx-www-11.portal 92 0 249 0 48 0 0 0
xx-cms-1.portal 95 0 1795 5 1669 4 48 48
xx-www-9.portal 198 0 634 1 278 0 0 0
xx-www-13.portal 239 0 986 2 366 1 0 0
xx-www-3.portal 266 0 1028 2 455 1 1 0
xx-www-12.portal 266 0 1070 2 561 1 3 2
xx-www-5.portal 300 0 1431 3 593 1 2 2
xx-www-10.portal 333 0 1221 3 678 1 3 2
xx-www-6.portal 334 0 989 2 446 1 1 0
xx-www-8.portal 358 0 1271 3 497 1 1 0
xx-www-4.portal 395 1 1544 4 744 2 0 0
xx-www-2.portal 445 1 1729 4 764 2 3 2
xx-www-1.portal 962 2 3555 9 1670 4 22 21
xx-www-7.portal 1016 2 3107 8 1643 4 117 115
====== ===== ====== ===== ====== ===== ===== =====
5378 14 20833 58 10607 29 207 199
Running for 359 seconds.

CONN total number of connections
CONN/s average number of connections per second
QRS total number of queries
QRS/s average number of queries per second
NCQRS total number of executed queries
NCQRS/s average number of executed queries per second
TIME total clock time in seconds for all queries
VTIME total CPU time in seconds for all queries

The NCQRS column represents the number of queries which were not served from the MySQL Query Cache while QRS represents all queries issued to MySQL (cached, non-cached or even non-valid queries). If values of VTIME are very close to values of TIME it means that queries are mostly CPU bound. On the other hand the bigger the difference between them the more time is spent on I/O. Another interesting thing to watch is how evenly load is coming from different clients especially in environments where clients are identical www servers behind load balancer and should be generating about the same traffic to a database.

All values are measured since the script was started. There might be some discrepancies with totals in the summary line - this is due to rounding errors. The script should work for MySQL versions 5.0.x, 5.1.x and perhaps for other versions as well. The script doesn't take into account connections made over a socket file - only tcp/ip connections.

The script requires PID of a mysql database as its first argument and a frequency at which output should be refreshed as a second argument, for example to monitor mysql instance with PID 12345 and refresh output every 10s:

./mysql_top.d 12345 10s



# cat mysql_top.d
#!/usr/sbin/dtrace -qCs

/*
Robert Milkowski
*/

#define CLIENTS self->client_ip == "10.10.10.11" ? "xx-www-1.portal" : \
self->client_ip == "10.10.10.12" ? "xx-www-2.portal" : \
self->client_ip == "10.10.10.13" ? "xx-www-3.portal" : \
self->client_ip == "10.10.10.14" ? "xx-www-4.portal" : \
self->client_ip == "10.10.10.15" ? "xx-www-5.portal" : \
self->client_ip == "10.10.10.16" ? "xx-www-6.portal" : \
self->client_ip == "10.10.10.17" ? "xx-www-7.portal" : \
self->client_ip == "10.10.10.18" ? "xx-www-8.portal" : \
self->client_ip == "10.10.10.19" ? "xx-www-9.portal" : \
self->client_ip == "10.10.10.20" ? "xx-www-10.portal" : \
self->client_ip == "10.10.10.21" ? "xx-www-11.portal" : \
self->client_ip == "10.10.10.22" ? "xx-www-12.portal" : \
self->client_ip == "10.10.10.23" ? "xx-www-13.portal" : \
self->client_ip == "10.10.10.29" ? "xx-cms-1.portal" : \
self->client_ip

#define LEGEND "\n \
CONN total number of connections \n \
CONN/s average number of connections per second \n \
QRS total number of queries \n \
QRS/s average number of queries per second \n \
NCQRS total number of executed queries \n \
NCQRS/s average number of executed queries per second \n \
TIME total clock time in seconds for all queries \n \
VTIME total CPU time in seconds for all queries\n"

BEGIN
{
start = timestamp;
total_queries = 0;
total_nc_queries = 0;
total_conn = 0;
total_time = 0;
total_vtime = 0;
}

syscall::getpeername:entry
/ pid == $1 /
{
self->in = 1;

self->arg0 = arg0; /* int s */
self->arg1 = arg1; /* struct sockaddr * */
self->arg2 = arg2; /* size_t len */
}

syscall::getpeername:return
/ self->in /
{
this->len = *(socklen_t *) copyin((uintptr_t)self->arg2, sizeof(socklen_t));
this->socks = (struct sockaddr *) copyin((uintptr_t)self->arg1, this->len);
this->hport = (uint_t)(this->socks->sa_data[0]);
this->lport = (uint_t)(this->socks->sa_data[1]);
this->hport <<= 8; this->port = this->hport + this->lport;

this->a1 = lltostr((uint_t)this->socks->sa_data[2]);
this->a2 = lltostr((uint_t)this->socks->sa_data[3]);
this->a3 = lltostr((uint_t)this->socks->sa_data[4]);
this->a4 = lltostr((uint_t)this->socks->sa_data[5]);
this->s1 = strjoin(this->a1, ".");
this->s2 = strjoin(this->s1, this->a2);
this->s1 = strjoin(this->s2, ".");
this->s2 = strjoin(this->s1, this->a3);
this->s1 = strjoin(this->s2, ".");
self->client_ip = strjoin(this->s1, this->a4);

@conn[CLIENTS] = count();
@conn_ps[CLIENTS] = count();

total_conn++;

self->arg0 = 0;
self->arg1 = 0;
self->arg2 = 0;
}

pid$1::*close_connection*:entry
/ self->in /
{
self->in = 0;
self->client_ip = 0;
}

pid$1::*mysql_parse*:entry
/ self->in /
{
self->t = timestamp;
self->vt = vtimestamp;

@query[CLIENTS] = count();
@query_ps[CLIENTS] = count();

total_queries++;
}

pid$1::*mysql_parse*:return
/ self->in /
{
@time[CLIENTS] = sum(timestamp-self->t);
@vtime[CLIENTS] = sum(vtimestamp-self->vt);

total_time += (timestamp - self->t);
total_vtime += (vtimestamp - self->vt);

self->t = 0;
self->vt = 0;
}

pid$1::*mysql_execute_command*:entry
/ self-> in /
{
@nc_query[CLIENTS] = count();
@nc_query_ps[CLIENTS] = count();

total_nc_queries++;
}

tick-$2
{
/* clear the screen and move cursor to top left corner */
printf("\033[H\033[J");

this->seconds = (timestamp - start) / 1000000000;

normalize(@conn_ps, this->seconds);
normalize(@query_ps, this->seconds);
normalize(@nc_query_ps, this->seconds);
normalize(@time, 1000000000);
normalize(@vtime, 1000000000);

printf("%-16s %s %s %s %s %s %s %s %s\n", \
"CLIENT IP", "CONN", "CONN/s", "QRS", "QRS/s", "NCQRS", "NCQRS/s", "TIME", "VTIME");
printa("%-16s %@6d %@5d %@6d %@5d %@6d %@5d %@5d %@5d\n", \
@conn, @conn_ps, @query, @query_ps, @nc_query, @nc_query_ps, @time, @vtime);
printf("%-16s %s %s %s %s %s %s %s %s\n", \
"", "======", "=====", "======", "=====", "======", "=====", "=====", "=====");
printf("%-16s %6d %5d %6d %5d %6d %5d %5d %5d\n", "", \
total_conn, total_conn/this->seconds, total_queries, total_queries/this->seconds, \
total_nc_queries, total_nc_queries/this->seconds, \
total_time/1000000000, total_vtime/1000000000);

/*
denormalize(@conn_ps);
denormalize(@query_ps);
denormalize(@nc_query_ps);
denormalize(@total_time);
denormalize(@total_vtime);
*/

printf("Running for %d seconds.\n", this->seconds);

printf(LEGEND);
}

Thursday, January 14, 2010

MySQL Query Time Histogram

When doing MySQL performance tuning on a live server it is often hard to tell what impact there will be on all queries as sometimes by increasing one of the MySQL caches you can make some queries to execute faster but others might get actually slower. However, depending on your environment, it might not necessarily be a bad thing. For example in web serving if most queries would execute within 0.1s but some odd queries need 5s to complete it is generally very bad as user would need to wait at least 5s to get a web page. Now if by some tuning you manage to get these long queries down to below 1s with the cost of getting some sub 0.1s queries taking more time but still less than 1s it would generally be a very good thing to do. Of course in other environments the time requirements might be different but the principle is the same.

Now it is actually very easy to get such a distribution of number of queries being executed by a given MySQL instance within a given time slot if you use DTrace.

1s resolution
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4700573
1 | 6366
2 | 35
3 | 23
4 | 39
5 | 8
6 | 6
7 | 5
8 | 7
9 | 4
>= 10 | 9

Running for 73344 seconds.

The above histogram shows that 4,7mln queries were executed below 1s each, then for another 6366 queries it took between 1-2s for each query to execute, and so on. Now lets do some tuning and see the results again (of course you want to measure for a similar amount of time during similar period of activity - these are just examples):

1s resolution
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4686051
1 | 2972
2 | 0

Running for 73024 seconds.

That is much better. It is of course very easy to change the resolution of the histogram - but I will leave it for you.

The script requires 2 arguments - PID of a database and how often it should refresh its output, for example in order to get an output every 10s for a database running with PID 12345 run the script as:

./mysql_query_time_distribution.d 12345 10s

The script doesn't distinguish between cached and non-cached queries, it doesn't detect bad (wrong syntax) queries either - however it is relatively easy to extend it to do so (maybe another blog entry one day). It should work fine with all MySQL versions 5.0.x and 5.1.x, possibly with other versions as well.


# cat mysql_query_time_distribution.d
#!/usr/sbin/dtrace -qs


BEGIN
{
start=timestamp;
}

pid$1::*mysql_parse*:entry
{
self->t=timestamp;
}

pid$1::*mysql_parse*:return
/ self->t /
{
@["1s resolution"]=lquantize((timestamp-self->t)/1000000000,0,10);

self->t=0;
}

tick-$2
{
printa(@);
printf("Running for %d seconds.\n", (timestamp-start)/1000000000);
}

Thursday, January 07, 2010

Identifying a Full Table Scan in MySQL with Dtrace

Yesterday I was looking at some performance issues with a mysql database. The database is version 5.1.x so no built-in DTrace SDT probes but still much can be done even without them. What I quickly noticed is that mysql was issuing several hundred thousands syscalls per second and most of them were pread()s and read()s. The databases are using MyISAM engine so mysql does not have a data buffer cache and leaves all the caching to a filesystem. I was interested in how many reads were performed per given query so I wrote a small dtrace script. The script takes as arguments a time after which it will exit and a threshold which represents minimum number of [p]read()s per query to query be printed.

So lets see an example output where we are interested only in queries which causes at least 10000 reads to be issued:
# ./m2.d 60s 10000
### read() count: 64076 ###
SELECT * FROM clip WHERE parent_id=20967 AND type=4 ORDER BY quality ASC

### read() count: 64076 ###
SELECT * FROM clip WHERE parent_id=14319 AND type=4 ORDER BY quality ASC

### read() count: 64076 ###
SELECT * FROM clip WHERE parent_id=20968 AND type=4 ORDER BY quality ASC

There are about 60k entries form parent_id column which suggests that mysql is doing a full table scan when executing above queries. A quick check within mysql revealed that there was no index for parent_id column so mysql was doing full table scans. After the index was created:

# ./m2.d 60s 1
[filtered out all unrelated queries]
### read() count: 6 ###
SELECT * FROM clip WHERE parent_id=22220 AND type=4 ORDER BY quality ASC

### read() count: 8 ###
SELECT * FROM clip WHERE parent_id=8264 AND type=4 ORDER BY quality ASC

### read() count: 4 ###
SELECT * FROM clip WHERE parent_id=21686 AND type=4 ORDER BY quality ASC

### read() count: 4 ###
SELECT * FROM clip WHERE parent_id=21687 AND type=4 ORDER BY quality ASC

So now each query is issuing 5 orders of magnitude less reads()!

Granted, all these reads were satisfied from ZFS ARC cache but still it saves hundreds of thousands unnecessary context switches and memory copying s making the queries *much* more quicker to execute and saving valuable CPU cycles. The real issue I was working on was a little bit more complicated but you get the idea.

The point I'm trying to make here is that although MySQL lacks good tools to analyze its workload you have a very powerful tool called dtrace which allows you to relatively quickly identify what queries are causing an issue and why. And all of that on a running live service without having to reconfigure or restart mysql. I know there is the MySQL Query Analyzer (or whatever it is called) but it requires a mysql proxy to be deployed... In this case it was much quicker and easier to use dtrace.

Below you find the script. Please notice that I had hard-coded the PID of the database and the script could be clean up, etc. - it is the working copy I used. The script can be easily modified to provide lots of additional useful information or it can be limited to only a specific myisam file, etc.

# cat m2.d
#!/usr/sbin/dtrace -qs

#pragma D option strsize=8192


pid13550::*mysql_parse*:entry
{
self->a=1;
self->query=copyinstr(arg1);
self->count=0;

}

pid13550::*mysql_parse*:return
/ self->a && self->count > $2 /
{
printf("### read() count: %d ###\n%s\n\n", self->count, self->query);

self->a=0;
self->query=0;

}

pid13550::*mysql_parse*:return
/ self->a /
{
self->a=0;
self->query=0;
}

syscall::*read*:entry
/ self->a /
{
self->count++;
}

tick-$1
{
exit(0);
}

Tuesday, January 05, 2010

zpool split

PSARC/2009/511 zpool split:
OVERVIEW:

Some practices in data centers are built around the use of a volume
manager's ability to clone data. An administrator will attach a set of
disks to mirror an existing configuration, wait for the resilver to
complete, and then physically detach and remove those disks to a new
location.

Currently in zfs, the only way to achieve this is by using zpool offline
to disable a set of disks, zpool detach to permanently remove them after
they've been offlined, move the disks over to a new host, zpool
force-import of the moved disks, and then zpool detach the disks that were
left behind.

This is cumbersome and prone to error, and even then the new pool
cannot be imported on the same host as the original.

PROPOSED SOLUTION:

Introduce a "zpool split" command. This will allow an administrator to
extract one disk from each mirrored top-level vdev and use them to create
a new pool with an exact copy of the data. The new pool can then be
imported on any machine that supports that pool's version.
The new feature should be available in build 131.
See implementation details.