Wednesday, April 19, 2006

Predictive Self Healing

Yesterday system which is E6500 with Solaris 10 reported few times memory errors (ECC corrected) on a board 0 dimm J3300.

Apr 18 17:45:30 server SUNW,UltraSPARC-II: [ID 197328] [AFT0] Corrected Memory Error detected by CPU8, errID 0x0002de3b.e98a88fa
Apr 18 17:45:30 server AFSR 0x00000000.00100000 AFAR 0x00000000.c709d428
Apr 18 17:45:30 server AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x11e7fe8
Apr 18 17:45:30 server UDBL Syndrome 0x64 Memory Module Board 0 J3300
Apr 18 17:45:30 server SUNW,UltraSPARC-II: [ID 429688] [AFT0] errID 0x0002de3b.e98a88fa Corrected Memory Error on Board 0 J3300 is Intermittent
Apr 18 17:45:30 server SUNW,UltraSPARC-II: [ID 671797] [AFT0] errID 0x0002de3b.e98a88fa ECC Data Bit 7 was in error and corrected

As these errors happened few more times system finally removed single page (8kB) from a system memory so problem will not escalate and won't possibly kill system or application. Well, I can live with 8kB less memory - that's real Predictive Self Healing.

Apr 18 19:14:51 server SUNW,UltraSPARC-II: [ID 566906 kern.warning] WARNING: [AFT0] Most recent 3 soft errors from Memory Module Board 0 J3300 exceed threshold (N=2, T=24h:00m) triggering page retire
Apr 18 19:14:51 server unix: [ID 618185 kern.notice] NOTICE: Scheduling removal of page 0x00000000.c709c000
Apr 18 19:15:50 server unix: [ID 693633 kern.notice] NOTICE: Page 0x00000000.c709c000 removed from service

