While building a storage appliance based on ZFS, one of the important features is an ability to identify physical disk locations, which is hard to do for SAS disks and easier for SATA. Solaris 11 has a topology framework which makes it much easier and it is nicely integrated with various subsystems like FMA, ZFS. Recently I came across this blog entry which highlights this specific issue of how to identify physical disk locations.
The other important factor is easy of use - in order to replace a failed disk drive one should be able to pull out a bad one, put in a replacement one and that's it - all the rest should be done automatically. There shouldn't be any need to login to the OS and issue some commands to assist with the replacement. Again, this is how things are in Solaris 11.
Let's see how it works in practice. Recently one disk reported two read errors and multiple checksum errors during zpool scrub. Because the affected pool is redundant (RAID-10), ZFS was able to detect the corruption, serve good data from other disk and fix the corrupted blocks on the affected disk.
The story doesn't end here though - FMA decided that too many checksum errors were reported for a single disk, so it activated a hot-spare as a precaution to pro-actively protect from the bad disk would misbehaving again. This is how the pool looked like after the hot-spare fully attached:
# zpool status -v pool-0 pool: pool-0 state: DEGRADED status: One or more devices has been diagnosed as degraded. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or 'fmadm repaired', or replace the device with 'zpool replace'. scan: resilvered 458G in 2h23m with 0 errors on Sat Nov 2 07:53:42 2013 config: NAME STATE READ WRITE CKSUM pool-0 DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 c0t5000CCA0165FC0F8d0 ONLINE 0 0 0 c0t5000CCA016217040d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c0t5000CCA01666AB64d0 ONLINE 0 0 0 c0t5000CCA0166F3BB8d0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c0t5000CCA0166F36C8d0 ONLINE 0 0 0 c0t5000CCA01661894Cd0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c0t5000CCA0166BE338d0 ONLINE 0 0 0 c0t5000CCA016626340d0 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c0t5000CCA0166DC81Cd0 ONLINE 0 0 0 c0t5000CCA016685238d0 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c0t5000CCA016636CA4d0 ONLINE 0 0 0 c0t5000CCA016687528d0 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 c0t5000CCA0166DC944d0 ONLINE 0 0 0 c0t5000CCA0166DC0CCd0 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 c0t5000CCA0166F4178d0 ONLINE 0 0 0 c0t5000CCA01668DCC0d0 ONLINE 0 0 0 mirror-8 DEGRADED 0 0 0 c0t5000CCA0166DD7BCd0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c0t5000CCA016671600d0 DEGRADED 2 0 49 c0t5000CCA0166876F8d0 ONLINE 0 0 0 mirror-9 ONLINE 0 0 0 c0t5000CCA0166DC20Cd0 ONLINE 0 0 0 c0t5000CCA0166877BCd0 ONLINE 0 0 0 mirror-10 ONLINE 0 0 0 c0t5000CCA0166F3334d0 ONLINE 0 0 0 c0t5000CCA0166BDD2Cd0 ONLINE 0 0 0 spares c0t5000CCA0166876F8d0 INUSE c0t5000CCA0166DCAACd0 AVAIL device details: c0t5000CCA016671600d0 DEGRADED too many errors status: FMA has degraded this device. action: Run 'fmadm faulty' for more information. Clear the errors using 'fmadm repaired'. see: http://support.oracle.com/msg/DISK-8000-D5 for recovery errors: No known data errors
See that mirror-8 vdev is a 3-way mirror now - as the affected disk is still functional it wasn't detached, but just in case it goes really bad we have a hot spare now forming a 3-way mirror.
Below is what FMA reported:
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Nov 02 05:29:39 d10c88f7-8e31-ce12-ab5c-8a759cf875c3 DISK-8000-D5 Major
Problem Status : solved
Diag Engine : eft / 1.16
System
Manufacturer : Oracle-Corporation
Name : SUN-FIRE-X4270-M3
Part_Number : 31792382+1+1
Serial_Number : XXXXXXXX
Host_ID : 004858f6
----------------------------------------
Suspect 1 of 1 :
Fault class : fault.io.scsi.disk.csum-zfs.transient
Certainty : 100%
Affects : dev:///:devid=id1,sd@n5000cca016671600//scsi_vhci/disk@g5000cca016671600
Status : faulted but still providing degraded service
FRU
Location : "HDD17"
Manufacturer : HITACHI
Name : H109090SESUN900G
Part_Number : HITACHI-H109090SESUN900G
Revision : A31A
Serial_Number : XXXXXXXX
Chassis
Manufacturer : Oracle-Corporation
Name : SUN-FIRE-X4270-M3
Part_Number : 31792382+1+1
Serial_Number : XXXXXXXX
Status : faulty
Description : There have been excessive transient ZFS checksum errors on this
disk.
Response : A hot-spare disk may have been activated.
Impact : If a hot spare is available it will be brought online and during
this time I/O could be impacted. If a hot spare isn't available
then I/O could be lost and data corruption is possible.
Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Please refer to the associated reference document at
http://support.oracle.com/msg/DISK-8000-D5 for the latest service
procedures and policies regarding this diagnosis.
Notice that FMA is reporting that the affected disk location is HDD17 - this corresponds to HDD17 slot on the x3-2l server. That way we know exactly which disk to replace.
We can also get physical disk locations from zpool status command:
# zpool status -l pool-0
pool: pool-0
state: DEGRADED
status: One or more devices has been diagnosed as degraded. An attempt
was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or 'fmadm repaired', or replace the device
with 'zpool replace'.
Run 'zpool status -v' to see device specific details.
scan: resilvered 458G in 2h23m with 0 errors on Sat Nov 2 07:53:42 2013
config:
NAME STATE READ WRITE CKSUM
pool-0 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
/dev/chassis/SYS/HDD00/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD01/disk ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
/dev/chassis/SYS/HDD02/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD03/disk ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
/dev/chassis/SYS/HDD04/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD05/disk ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
/dev/chassis/SYS/HDD06/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD07/disk ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
/dev/chassis/SYS/HDD08/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD09/disk ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
/dev/chassis/SYS/HDD10/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD11/disk ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
/dev/chassis/SYS/HDD12/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD13/disk ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
/dev/chassis/SYS/HDD14/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD15/disk ONLINE 0 0 0
mirror-8 DEGRADED 0 0 0
/dev/chassis/SYS/HDD16/disk ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
/dev/chassis/SYS/HDD17/disk DEGRADED 2 0 49
/dev/chassis/SYS/HDD22/disk ONLINE 0 0 0
mirror-9 ONLINE 0 0 0
/dev/chassis/SYS/HDD18/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD19/disk ONLINE 0 0 0
mirror-10 ONLINE 0 0 0
/dev/chassis/SYS/HDD20/disk ONLINE 0 0 0
/dev/chassis/SYS/HDD21/disk ONLINE 0 0 0
spares
/dev/chassis/SYS/HDD22/disk INUSE
/dev/chassis/SYS/HDD23/disk AVAIL
errors: No known data errors
Now it is up to us if we want to run a scrub and wait few days and if there are no new errors clear the pool status and deactivate hot spare, or if we don't want to take any chances and replace the affected disk drive. We decided to replace it. The disk in bay 17 of x3-2l was physically pulled out, a replacement was put in in its place and since we have autoreplace property set to true on the pool, FMA/ZFS automatically put an EFI label on the new disk and attached it to the pool, once it fully synchronized a hot spare was mas detached and made available again. We didn't have to login to the OS and co-ordinate in any way with the physical disk replacement.
Here is how zpool history looked like:
# zpool history -i pool-0 … 2013-11-06.20:38:03 [internal pool scrub txg:1709498] func=2 mintxg=3 maxtxg=1709499 logs=0 2013-11-06.20:38:17 [internal vdev attach txg:1709501] replace vdev=/dev/dsk/c0t5000CCA016217834d0s0 \
for vdev=/dev/dsk/c0t5000CCA016671600d0s0 2013-11-06.23:01:33 [internal pool scrub done txg:1710852] complete=1 logs=0 2013-11-06.23:01:34 [internal vdev detach txg:1710854] vdev=/dev/dsk/c0t5000CCA016671600d0s0 2013-11-06.23:01:39 [internal vdev detach txg:1710855] vdev=/dev/dsk/c0t5000CCA0166876F8d0s0
Let's see the pool status after the replacement disk fully synchronized:
# zpool status -l pool-0 pool: pool-0 state: ONLINE scan: resilvered 459G in 2h23m with 0 errors on Wed Nov 6 23:01:34 2013 config: NAME STATE READ WRITE CKSUM pool-0 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /dev/chassis/SYS/HDD00/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD01/disk ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 /dev/chassis/SYS/HDD02/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD03/disk ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 /dev/chassis/SYS/HDD04/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD05/disk ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 /dev/chassis/SYS/HDD06/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD07/disk ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 /dev/chassis/SYS/HDD08/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD09/disk ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 /dev/chassis/SYS/HDD10/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD11/disk ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 /dev/chassis/SYS/HDD12/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD13/disk ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 /dev/chassis/SYS/HDD14/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD15/disk ONLINE 0 0 0 mirror-8 ONLINE 0 0 0 /dev/chassis/SYS/HDD16/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD17/disk ONLINE 0 0 0 mirror-9 ONLINE 0 0 0 /dev/chassis/SYS/HDD18/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD19/disk ONLINE 0 0 0 mirror-10 ONLINE 0 0 0 /dev/chassis/SYS/HDD20/disk ONLINE 0 0 0 /dev/chassis/SYS/HDD21/disk ONLINE 0 0 0 spares /dev/chassis/SYS/HDD22/disk AVAIL /dev/chassis/SYS/HDD23/disk AVAIL errors: No known data errors
All is back to normal.
We can also check if all disks, including the new one, are of the same part number, firmware level, etc.
# diskinfo -t disk -o Rcmenf1
R:receptacle-name c:occupant-compdev m:occupant-mfg e:occupant-model n:occupant-part f:occupant-firm 1:occupant-capacity
----------------- --------------------- -------------- ---------------- ------------------------ --------------- -------------------
SYS/HDD00 c0t5000CCA0165FC0F8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD01 c0t5000CCA016217040d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD02 c0t5000CCA01666AB64d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD03 c0t5000CCA0166F3BB8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD04 c0t5000CCA0166F36C8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD05 c0t5000CCA01661894Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD06 c0t5000CCA0166BE338d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD07 c0t5000CCA016626340d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD08 c0t5000CCA0166DC81Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD09 c0t5000CCA016685238d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD10 c0t5000CCA016636CA4d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD11 c0t5000CCA016687528d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD12 c0t5000CCA0166DC944d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD13 c0t5000CCA0166DC0CCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD14 c0t5000CCA0166F4178d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD15 c0t5000CCA01668DCC0d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD16 c0t5000CCA0166DD7BCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD17 c0t5000CCA016217834d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD18 c0t5000CCA0166DC20Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD19 c0t5000CCA0166877BCd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD20 c0t5000CCA0166F3334d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD21 c0t5000CCA0166BDD2Cd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD22 c0t5000CCA0166876F8d0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
SYS/HDD23 c0t5000CCA0166DCAACd0 HITACHI H109090SESUN900G HITACHI-H109090SESUN900G A31A 900185481216
This is a a very cool integration of different features in Solaris 11 which makes ZFS based solution much more reliable and easy to support.
The topology framework should work out of the box on Oracle servers and Solaris 11. The above example is from Solaris 11.1 + SRU10 running on X3-2L server with
24 disks in front (and another 2x in rear for OS, mirrored by the controller itself). It has a simple HBA
which presents all of the front disks as JBODs which is perfect for ZFS.
The topology framework also works on 3rd party hardware, but depending on a particular set-up some additional configuration steps might be required, like defining Bay Labels. If disks are behind SAS expander then it is more complicated to get it working and I'm not sure if there is a documented procedure describing how to do it.
Won't the croinfo command not provide you with a better view of disk to physical topology map?
ReplyDeletehttp://docs.oracle.com/cd/E23824_01/html/821-1462/croinfo-1m.html
diskinfo and croinfo are the same utility, depending on how you start it different defaults are provided.
ReplyDelete