RAL Tier1 Incident 20081102 RAID5 double disk failure
Site: RAL-LCG2
Incident Date: 2008-11-02
Severity: Field not defined yet
Service: CASTOR
Impacted: ATLAS simStrip
Incident Summary: Double disk failure in RAID5 array on gdss156 during rebuild rendered array inoperable. First disk failed early on Saturday morning with the second early on Sunday morning. The second drive failure appears to have occurred before the rebuild finished and was a data disk, not a hot spare, hence the inoperable array.
Type of Impact: Data Loss
Incident duration:
Report date: 2008-11-07
Reported by: James Thorne, Tier1 Fabric Team
Related URLs: RAL Tier1 Incident 20081027, GGUS ticket 43111
Incident details:
| Date | Time | Who/What | Entry | 
|---|---|---|---|
| 2008-11-01 | 01:32:42 | syslog | Drive in port 3 fails early on a Saturday morning: 
 | 
| 2008-11-01 | 01:43:57 | Nagios | Nagios issues first alarm: 
 | 
| 2008-11-02 | 03:13:54 | syslog | Drive in port 13 fails early the following day and the array is no longer operable: 
 | 
| 2008-11-02 | 04:02:57 | Nagios | Nagios issues an alarm for fsprobe as it cannot write to the filesystem -> second drive failure 
 | 
| 2008-11-02 | 12:46:00 | Alessandro Di Girolamo (ATLAS) | Raised GGUS ticket 43111 | 
| 2008-11-02 | 19:21:55 | Catalin Condurache (on call) | Created a Tier1 helpdesk ticket requesting that machine is taken out of CASTOR after seeing the GGUS ticket. | 
| 2008-11-02 | 20:45:14 | Chris Kruk | Removed gdss156 from production. | 
| 2008-11-03 | 14:18:00 | James Adams | Reported problem to Viglen, along with the output of Viglen's diagnostic tool. Waiting for Viglen's feedback. | 
Future mitigation:
For the measures taken regarding double disk failures, see the future mitigation section of RAL Tier1 Incident 20081027.
Related issues:
It was noted that there was an erroneous message in the logs in both recent double disk failures. In this incident, the messages file contained an obviously incorrect message after the failure of the last drive to fail, port 13:
Nov 2 03:13:54 gdss156 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=13. Nov 2 03:13:54 gdss156 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=-2147483635.
In RAL Tier1 Incident 20081027, there is a similar message after the failure of port 5, again the last drive to fail:
Oct 27 04:37:56 gdss154 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=5. Oct 27 04:37:56 gdss154 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=-2147483643.
This looks like an integer wraparound in the controller firmware as if we take the correct ports and the incorrect ports reported above:
         13 - -2147483635  =  2147483648
          5 - -2147483643  =  2147483648
This has been reported to Viglen and 3ware.
Timeline
| Date | Time | Comment | |
|---|---|---|---|
| Actually started | 2008-11-01 | 01:32:42 | First drive failed | 
| Fault first detected | 2008-11-02 | 04:02:57 | Nagios | 
| First Advisory Issued | 2008-11-03 | 14:00 | Gareth Smith reported the problem at the WLCG daily operations meeting. | 
| First Intervention | 2008-11-03 | 09:00:00 | James Adams takes a look and confirms data is unrecoverable. | 
| Fault Fixed | When was the problem resolved | ||
| Announced as Fixed | How, to who | ||
| Downtime(s) Logged in GOCDB | n/a | n/a | none | 
| Other Advisories Issued | 2008-11-03 | 12:24 | Gareth Smith emailed atlas-uk-comp-operations@cern.ch. | 
| Other Advisories Issued | 2008-11-03 | n/a | Brian Davies remained in contact with ATLAS. | 
