RAL Tier1 Incident 20100212 Tape problems led to data loss
Two separate problems with tapes led to file (data) loss
Site: RAL-LCG2
Incident Date: 2010-02-12
Severity: Severe
Service: CASTOR Tape
Impacted: CMS
Incident Summary: Two separate problems were discovered on two tapes. While the issues were unconnected, both led to file (data) loss.
Type of Impact: Data Loss
Incident duration: N/A
Report date: 2010-02-15
Reported by: Gareth Smith, Tim Folkes
Related URLs None
Incident details:
As a result of routine tape monitoring during a 'repack' operation problems were found on two tapes that contained CMS data. The tapes were written at widely separated times and the root causes of the two faults are not connected.
First Tape. Tape written CS1472 found to be giving hard errors on reads and writes. 102 files lost. Tests showed that tape media is defective.
Second Tape. Tape written CS3410 found to have problems reading files. Investigation showed that although castor claims to have written 327 files all the way to the end of the tape, the double tape mark indicating the end of the data was after file position 222. Can not tell if data was written after this point or not. 105 files lost. The first 222 files on the tape are available.
Approximately 50 of the files were recovered from disk. However, as both tapes in Castor D0T1 service class it was not expected that many disk copies would be available locally at RAL.
Future mitigation:
Further efforts after the failure could be considered, such as sending tape CS1472 away for the professional data recovery people to have a look at. However, CS3410 was a software error rather than a media error and we don’t know if they have the ability to read past a double tape mark.
A review of proactive procedures, to assess if anything (such as 'scrubbing' or continually reading tapes - as done elsewhere) should be undertaken. This in turn requires monitoring of our tape failure rates to compare with industry averages.
Note: Added May 2013 when closing this incident: Tape scrubbing or 'verification' (as referred to in Castor) has been enabled and is now in regular use to validate tapes.
Related issues:
None
Timeline
| Date | Time | Comment | |
|---|---|---|---|
| CS1472 tape written | 2010/1/7 | CMS data migrated to tape no: | |
| CS3410 tape written | 2009/10/17 | CMS data migrated to tape no: | |
| Problem first discovered on tape CS1472 | 2010-02-11 | During repack | |
| Problem first discovered on tape CS3410 | 2010-02-10 (or thereabouts) | During user read | |
| CMS notified of issue | approx: 2010-02-11 | ||
| CMS provided with list of files lost. | 2010-02-12 | 
