RAL Tier1 Incident 20091009 long
Current Situation: Incident Ongoing. Report not complete.
As this starts from another incident there may - for a while - be some odd text in here...
Error in restore of Castor Database led to Data Loss.
Site: RAL-LCG2
Incident Date: 2009-10-09
Severity: 4
Service: Storage (Castor)
Impacted: All local VOs
Incident Summary: Following a of both disk systems used to host the Castor databases it was necessary to carry out a database restore. The restore completed but the database was subsequently found to have been restored to an earlier date. This resulted in data loss from Castor.
Type of Impact: Data Loss
Incident duration: N/A
Report date: 2009-10-20
Reported by: Gareth Smith, Gordon Brown, Keir Hawker
Status of this Report: In preparation
Related URLs: The failures in the hardware underneath the Castor databases are dealt with in a separate incident report at: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091004
Description of the Castor database Systems.:
There are two Castor databases:
- PLUTO - which contains the Castor nameserver database (for all instances) and the stager databases for the CMS and 'GEN' instances.
- NEPTUNE - which contains the Castor stager databases for the Atlas and LHCb instances.
Since May 2009, the CASTOR Oracle databases have been using a pair of high end storage arrays. These have been configured as two disk groups (the actual Oracle terminology is FAIL GROUP) with the mirroring being performed by the Oracle Automatic Storage Manager (ASM) layer. These do not act as a classical master/mirror setup, they are just two disk groups with the same data being written to each. The database can be run on either disk group.
Apart from the redundancy, this setup was chosen because of the ability to add other disk groups without having to offline the database. This would be essential to migrate to new disks or add more capacity to the existing hardware.
Incident Overview:
Before the failure on 4th October that led to loss of service there had been problems on one of the disk groups. This had led to a previous outage of several hours on 10th September. Following that incident a patch was received from Oracle and applied (on 21st September) such that the databases would not hang when there was a failure of one of the disk groups in the same manner. From this point the databases were running on both disk groups. Following another failure of the same disk group on 24th September the system was running on only one of the disk groups.
late morning on Sunday 4th October the other disk group failed. On Monday 5th October work was undertaken to rapidly return Castor to service. An initial database restore was done for the Neptune database to a point just ahead of errors seen on the storage system at Saturday 3rd October 15:20. However, the ongoing underlying hardware problems led to the database running chaotically and the decision was taken to prepare alternative hardware ready to receive the database.
The disaster review team concluded that there was no need to restore to such an early point, and the restore was repeated to a later point (4th October 04:00).
There were problems in restoring the PLUTO database. A decision was made to restore both PLUTO and NEPTUNE to the last point at which there was consistency between them, this being Sat 3-Oct 20:00. This was done on 6th October. After space issues with a node on the PLUTO database, the database came up during the afternoon - albeit on the undetected “old” disk group which last had data written to it on 24th September. A full database backup was then taken.
Following a further incident review it was decided that data loss could be minimised by restoring PLUTO to the last point possible (moment before the crash???), and NEPTUNE to the last available point 04:00 on Saturday 4-Oct. The implications of restoring the two databases to different points having been discussed with the Castor developers. PLUTO was restored to backup from 6th October. However, unknown at the time was that this only contained data to 24th September.
On Friday 9th October the database restores were complete, and after some hours checking Castor was enabled. The checking focused on looking for problems that might arise owing to the difference in the restore points between NEPTUNE and PLUTO. Castor services were re-enabled by 17:00 on that day. Castor appeared to work well over the weekend.
During the day on Monday 12th October there were initial reports from CMS of files missing from castor. These were not understood. It became clear at the end of Tuesday 13th that there was something wrong with the PLUTO database and the restore was not as expected. The realisation that the FileIDs in Castor had jumped back indicated the problem. (The FileID is a unique number generated by from Oracle and it is normally increasing.)
Subsequent discussions with Castor developers (at the Castor face to Face meeting at CERN that week) suggested a problem with this re-use of FileIDs. Notably there was a theoretical way in which duplicate IDs might lead to data being deleted subsequently. This required an atypical manner of working. Nevertheless, while believing the risk small, castor was stopped briefly between 11:05 and 11:15 on Thursday 15th October for the FileID number to be jumped up beyond the range where it could be duplicated.
On Friday 16th October the PLUTO database was restored to a point the day before the final crash (Sat, 3 Oct 2009 04:21:47 UTC). This enabled the experiments to be supplied with a list of files written to Castor during the window of the database rewind, and hence lost to Castor. This database restore was not quite to the point of the crash, the list of lost files in the last day being extracted from the Castor log files.
Future mitigation:
| Issue | Response | 
|---|---|
| Data Integrity | The main goal is to have confidence in the database always running on the most up to date disk group. Database Services are working with Oracle to identify best practice in syncing “down” disk groups – and making sure the database will run on the “latest” disk group when being restarted. A way needs to be identified to make sure that the database opens on the correct disk group automatically and every time. | 
| database Architecture | A complete review of the current database ASM setup is taking place. Database Services are working with Oracle and CERN to further investigate the current storage policy and identify any appropriate alternatives (such as mirroring at the hardware level). Availability, redundancy and recoverability will obviously be taken in to account. | 
| Database Backup and Recovery Procedures | The RAL database Services team continues to amend its policies and was already in the process of analysing and practicing over 20 different recovery scenarios. The team will continue to gain knowledge and experience in these complicated areas and work with other institutions to refine and discuss its policies and procedures. It is important database restores can be repeated to a different end time. Any restore must not jeopardise future restores. When recovering a database, the team have noted it is vital to switch off all file deletion (i.e. through “DELETE OBSOLETE”) until the database has been recovered and verified as correct. | 
| Database Log Checking & Time Constraints | More thorough checking of the restore logs – by at least two database administrators. It is important that time is taken during the restore and recovery procedures and actions are checked carefully, fully logged and the timeline written down as it happens. Procedures will be updated to clarify exactly what to check for in the backup and recovery logs. There is a pressure on the team to restore service availability but attention to detail should not be compromised as the consequences are worse. | 
| Complexity of database restore requests. | There were attempts to restore the databases to a number of different times during the recovery. Whilst each of these was made at the time with the aim of minimising data loss, it is clear that this conflicting requests increased the complexity of the restore tasks. This was coupled with the restores to the initial disk arrays suffering from the hardware continuing to be unreliable. At the same time pressures to restore services as soon as possible were evident. Procedures should be amended that any database restore point is agreed in advance ahead of time. | 
| Faulty Database not detected before Castor re-started. | The Castor team put a lot of effort into checking the system befor re-enabling services on the 9th October. Nevertheless the problem remained undetected. Generic tests to verify the data in Castor need to be devised and used. (For example verifying recently written files against the LFC). Linked to this a start-up procedure that includes time in a read-only mode, or with synchronisation disabled should be investigated. | 
Related issues: None.
Timeline
| Date | Time | Comment | |
|---|---|---|---|
| Actually Started | 2009-10-09 | 17:00 | Restart of Castor | 
| First Realisation of a problem | 2009-10-12 | First report of missing files. | |
| First announcement of Problem | 2009-10-13 | 15:15 | Announcement of problem with missing files being investigated. | 
| Problem resolved | |||
| Announced as Fixed | 
Incident details Timeline:
| Date | Time | Who/What | Entry | 
|---|---|---|---|
| 2009-10-05 | 14:00 | Castor/Database Team | Initial restore of NEPTUNE database to 15:20 on 3rd October. | 
| 2009-10-05 | 16:00 | Tier1 Disaster Management Team | Invoke Disaster Management Process | 
| 2009-10-05 | 17:50 | Database Team | Start restore of Neptune database to 04:00 on 4th October. (Change from initial restore following disaster management meeting). | 
| 2009-10-06 | 08:00 | Database Team | Start to Restore Neptune/Pluto to Last Consistent Moment Between Them (Sat 3-Oct 20:00). | 
| 2009-10-06 | afternoon | Database Team | After space issues with a node on the PLUTO database, the database comes up on the afternoon - albeit on the undetected “old” disk group which last had data written to it on 24th September. Full database backup taken. | 
| 2009-10-07 | Database Team | Attempts are still being made to restore PLUTO to 3rd October but this is hampered by controlfile issues after the new incarnation had been created (since understood). | |
| 2009-10-08 | 09:00 | Database Team | Restore Neptune to Last Possible Point 04-Oct 04:00 | 
| 2009-10-08 | Database Team | Restore PLUTO to Last Possible Point due to failure to restore to 3rd October. Restored to backup from 6th October – which only contained data to 24th September. | |
| 2009-10-09 | 17:00 | Castor team | Restart of Castor Services | 
| 2009-10-13 | 17:45 | Matt Viljoen | Report have found evidence of FileIDs within Castor being re-used. | 
