Tuesday, February 27, 2007

Data corruption on SATA array

Below is today's Security Alert - another good example that even with currently shipping arrays data corruption still happens in an array firmware and ZFS is the only file system which is able to detect such corruption and correct it.

That alert explains some issues here...


Sun Alert ID: 102815
Synopsis: SE3310/SE3320/SE3510/SE3511 Storage Arrays May
Experience Data Integrity Events
Product: Sun StorageTek 3510 FC Array, Sun StorEdge 3310 NAS
Array, Sun StorageTek 3320 SCSI Array, Sun
StorageTek 3511 SATA Array
Category: Data Loss, Availability
Date Released: 22-Feb-2007

To view this Sun Alert document please go to the following URL:
http://sunsolve.sun.com/search/document.do?assetkey=1-26-102815-1





Sun(sm) Alert Notification

* Sun Alert ID: 102815
* Synopsis: SE3310/SE3320/SE3510/SE3511 Storage Arrays May Experience Data Integrity Events
* Category: Data Loss, Availability
* Product: Sun StorageTek 3510 FC Array, , Sun StorageTek 3320 SCSI Array, Sun StorageTek 3511 SATA Array
* BugIDs: 6511494
* Avoidance: Workaround
* State: Workaround
* Date Released: 22-Feb-2007
* Date Closed:
* Date Modified:

1. Impact

System panics and warning messages on the host Operating System may occur due to a filesystem reading and acting on incorrect data from the disk or a user application reading and acting on incorrect data from the array.
2. Contributing Factors

This issue can occur on the following platforms:

* Sun StorEdge 3310 (SCSI) Array with firmware version 4.11K/4.13B/4.15F (as delivered in patch 113722-10/113722-11/113722-15)
* Sun StorageTek 3320 (SCSI) Array with firmware version 4.15G (as delivered in patch 113730-01)
* Sun StorageTek 3510 3510 (FC) Array with firmware version 4.11I/4.13C/4.15F (as delivered in patch 113723-10/113723-11/113723-15)
* Sun StorageTek 3511 (FC) Array with firmware version 4.11I/4.13C/4.15F (as delivered in patch 113724-04/113724-05/113724-09)

The above raid arrays (single or double controller) with "Write Behind Caching" enabled on Raid 5 LUNs (or other raid level LUNs and an array disk administration action occurs), can return stale data when the i/o contains writes and reads in a very specific pattern. This pattern has only be observed in UFS metadata updates but could be seen in other situations.
3. Symptoms

Filesystem warnings and panics occur and with no indication of an underlying storage issue. For UFS these messages could include:

"panic: Freeing Free Frag"
WARNING: /: unexpected allocated inode XXXXXX, run fsck(1M) -o f
WARNING: /: unexpected free inode XXXXXX, run fsck(1M) -o f

This list is not exhaustive and other symptoms of stale data read might be seen.

Solution Summary Top
4. Relief/Workaround

Disable the "write behind" caching option inside the array using your preferred array administration tool (sccli(1M) or telnet). This workaround can be removed on final resolution.

Use ZFS to detect (and correct if configured) the Data Integrity Events.

If not using a filesystem make sure your application has checksums and identity information embedded in its disk data so it can detect Data Integrity Events.

Migrating back to 3.X firmware is a major task and is not recommended.

5. Resolution

A final resolution is pending completion.

Thursday, February 22, 2007

ldapsearch story

Recently we've migrated a system with a lot of small scripts, programs, etc. running on Linux to Solaris. The problem with such systems running a lot of different relatively small scripts and programs written by different people over long period of time is that it's really hard to tell which of those programs are eating up CPU most or which of them are generating most IO's, etc. Ok, it is hard to almost impossible to tell on Linux - more like guessing than real data.

But once we migrated to Solaris we quickly tried dtrace. And the result was really surprising - from all those things the application (by name) which eats most of the cpu is ldapsearch utility. It's been used by some scripts by no one expected it to be the top application. As many of those scripts are written in Perl we tried to use ldap perl module instead and once we did it for some scripts their CPU usage dropped considerably being somewhere in a noise of all the other applications.

How hard was it to get the conclusion?

dtrace -n sched:::on-cpu'{self->t=timestamp;}' \
-n sched:::off-cpu'/self->t/{@[execname]=sum(timestamp-self->t);self->t=0;}'


Another interesting thing is that when we use built-in zfs compression (ljzb) we get about 3.6 compression ratio for all our collected data - yes that means the data are reduced over 3 times in size on disk without changing any applications.

Disk failures in data center

Really interesting paper from Google about disk failures in large disk population (over 100k). Also information if SMART can be reliably used in disk failure prediction.

There's also another paper presented by people from Carnegie Mellon University.

Some observations are really surprising - like you get higher probability of disk failure if it's running below 20C than if it's running about 40C. Another interesting thing is that SATA disks seem to have similar ARR to FC and SCSI disks.

Anyone managing "disks" should read those two papers.

update: NetApp response to above papers.
Also interesting paper from Seagate and Microsoft on SATA disks.

This time IBM responded.