RAL Tier1 Incident 20090805 Data loss following multiple disk failures
Site: Name of Site (eg RAL-LCG2)
Incident Date: 2009-08-05
Severity: Field not defined yet
Service: CASTOR
Impacted: ATLAS simStrip
Incident Summary: Disk server gdss169, configured as RAID5 plus hotspare, lost two drives (Port 9 & 15) on 5th August 2009. Managed to save the data with swift actions. But system lost another three drives (Port 9, 13 and 15) due to Air Con. failure in machine room (HPD Area). Drives in Port 9 and 15 failed on 12th August 2009 (within couple of minutes) and port 13 failed on 17th August 2009 after powering on disk servers.
Type of Impact: Data Loss
Incident duration:
Report date: 2009-08-18
Reported by: Kashif Hafeez, Tier1 Fabric Team
Related URLs:
Incident details:
Detailed timeline of events:
| Date | Time | Who/What | Entry | 
|---|---|---|---|
| 04/08/2009 | 17:28:20 | Nagios | Issued alarm: 
 Aug 04, 2009 05:28.20PM (0x04:0x0002): Degraded unit: unit=0, port=15 Aug 04, 2009 05:28.20PM (0x04:0x0009): Drive timeout detected: port=15: | 
| 05/08/2009 | 08:44:39 | Kashif Hafeez | Ticket created (RT # 48759) and Reported failed drive in port 15 to Viglen.(Wednesday 5th August 2009 at 08:35) | 
| 06/08/2009 | 09:56:46 | Kashif Hafeez | Noticed that system has have another faulty drive in port 9 (No log messages for drive 9) | 
| 06/08/2009 | 09:56:46 | Shaun | System had been taken out of production with coordination of Castor team. | 
| 06/08/2009 | 10:37:35 | Kashif Hafeez | Replaced drive in port 9 and rebuild started on port 9. (Borrowed from gdss87 Port 15) 
 Aug 06, 2009 10:05.56AM (0x04:0x001A): Drive inserted: port=9 Aug 06, 2009 10:05.37AM (0x04:0x0019): Drive removed: port=9 | 
| 06/08/2009 | 10:39:08 | Kashif Hafeez | Replaced drive in port 15 and added as hotspare. (New drive received from viglen) 
 Aug 06, 2009 10:10.13AM (0x04:0x000B): Rebuild started: unit=0 Aug 06, 2009 10:09.57AM (0x04:0x001A): Drive inserted: port=15 
 | 
| 07/08/2009 | 11:06:17 | James Thorne | Acknowledged Rebuild completed and informed Castor team to put system back into production. 
 Aug 07, 2009 12:05.37AM (0x04:0x0005): Rebuild completed: unit=0 | 
| 07/08/2009 | 15:34:13 | Chris | Confirmed that system has been back into production | 
| 11/08/2009 | 23:44:38 | Syslogs | Issued soft alarm:  | 
| 12/08/2009 | 12:15:23 | syslogs | Issued hard alarm:  | 
| 12/08/2009 | 12:20:47 | Syslogs | Two drives failed in port 15 and 9. 
 Aug 12, 2009 12:23.05AM (0x04:0x0042): Primary DCB read error occurred: port=9, error=0x204 Aug 12, 2009 12:22.35AM (0x04:0x005E): Cache synchronization completed: unit=0 Aug 12, 2009 12:22.35AM (0x04:0x0042): Primary DCB read error occurred: port=9, error=0x204 Aug 12, 2009 12:20.47AM (0x04:0x0009): Drive timeout detected: port=15 
 | 
| 12/08/2009 | 13:00:16 | Martin Bly/James Thorne | Powered off Tier1 disk servers and batch systems due to Air Con. failure in Machine room. (HPD Area) | 
| 17/08/2009 | 10:30:53 | James Thorne | Turned ON Tier1 disk servers. | 
| 17/08/2009 | 10:38:53 | Syslogs | Another drive failure in port 13. 
 Aug 17, 2009 10:38.54AM (0x04:0x0009): Drive timeout detected: port=13 Aug 17, 2009 10:38.54AM (0x04:0x0009): Drive timeout detected: port=13 Aug 17, 2009 10:38.54AM (0x04:0x0009): Drive timeout detected: port=13 Aug 17, 2009 10:38.53AM (0x04:0x0009): Drive timeout detected: port=13 Aug 17, 2009 10:38.53AM (0x04:0x000A): Drive error detected: unit=0, Aug 17, 2009 10:38.53AM (0x04:0x0009): Drive timeout detected: port=13 | 
| 17/08/2009 | 11:30:00 | James Thorne | Noticed that system has failed drives in port 9 and 13. | 
| 17/08/2009 | 14:01:00 | John Kelly | Created RT # 49105. | 
| 17/08/2009 | 15:00:00 | Kashif Hafeez | Informed Castor team to take system out of production and also asked for spare disk server for copying data. gdss273 pointed out by castor for copying data. | 
| 17/08/2009 | 15:25:00 | James Thorne | Tried to copy data from gdss169 to gdss273. | 
| 17/08/2009 | 15:30:54 | James Thorne | Failed to copy data. (Array was inoperable) | 
| 17/08/200 | 15:35:21 | Kashif Hafeez | Replaced drive in port 9 also powered off/on system but didn't work. (Borrowed from gdss87 Port 14) | 
| 17/08/2009 | 16:01:00 | James Thorne/Kashif Hafeez | Informed Castor and Production Team that the data is irrecoverable. | 
Future mitigation:
Free text description of how site plans to minimise future occurrences
Related issues:
Anything else relevant
Timeline
| Date | Time | Comment | |
|---|---|---|---|
| Actually Started | 2009-08-12 | 12:20:47 | Two drives failed (Port 15 and 9) | 
| Fault first detected | 12/08/2009 | 12:20:47 | Syslogs/Admin/User | 
| First Advisory Issued | How/To who | ||
| First Intervention | When you first tried to intervene | ||
| Fault Fixed | When was the problem resolved | ||
| Announced as Fixed | How, to who | ||
| Downtime(s) Logged in GOCDB | at risk/unscheduled down (what components/VOs) repeat as necessary | ||
| Other Advisories Issued | Where etc repeat as necessary | 
