Tier1 Operations Report 2009-10-14
From GridPP Wiki
								
												
				Contents
RAL Tier1 Operations Report for 14th October 2009.
This is a review of issues since the last meeting on 7th October.
Current operational status and issues.
-  Major problems with the hardware underneath both the Castor Oracle databases and those used to support the LFC, FTS and 3D databases. All these services were unavailable for at least part of the week. The problems were caused by multiple failures in the disk systems that host the Oracle databases behind these services. We are currently running with the databases hosted on alternative hardware while the fault is investigated. We currently conclude the fault is environmental, and work so far points at it being electrical.
- LFC and FTS were unabailable from lunchtime Tuesday (6th) to Wednesday (7th) late afternoon.
- Castor was unavailable from Sunday 4th to the end of Friday afternoon (9th)
- 3D databases (inlcuding lhcb-lfc) unavailable from lunchtime Tuesday (6th) to Monday (12th) early afternoon.
 
- The restore of the Castor databases has introduced a problem. It appears that the Castor databases were restored to a point early on the 24th Septemeber and all files added to Castor between that date up to the failure on the 4th October may be lost.
- The patched version of the SRM (2.8-1) has been installed for srm-atlas. Await installation for other SRMs (delayed by the database hardware problems).
- Swine ‘Flu. As previously reported: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite.
Review of Issues during week 7th to 14th October.
- The mainpoint here was the outages referred to in the 'Current operational status' section above.
- There was a problem yesterday late afternoon (13th October) and last night. At around 16:00 the Tier1 started failing SRM SAM tests. Investigations showed a high CPU usage on some processes in the Oracle RAC. This was traced to a high load causing an out of memory condition, which stopped Oracle and then resulted in a node reboot. This was followed during the evening and night by problems on the CEs and batch system as well as srm-atlas. A DNS problem had occurred and evidence suggests this was the underlying cause.
Advanced warning:
There are no scheduled outages declared in the GOC DB. However, we will need to reschedule the installation of the updated SRM for the CMS, LHCb and GEN Castor instances.
Table showing entries in GOC DB starting between 7th and 14th October.
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| srm-atlas, lcgce06 | UNSCHEDULED | OUTAGE | 13/10/2009 19:47 | 13/10/2009 23:05 | 3 hours and 18 minutes | Castor Atlas down due to local networking problems. | 
| lcgce07 | UNSCHEDULED | AT_RISK | 13/10/2009 10:00 | 13/10/2009 12:00 | 2 hours | At risk to swap a broken hard disk | 
| lhcb-lfc, lugh, ogma | UNSCHEDULED | OUTAGE | 09/10/2009 18:00 | 12/10/2009 14:45 | 2 days, 20 hours and 45 minutes | Extending downtime as the work to restore the services are still ongoing | 
| All Castor & CEs | UNSCHEDULED | OUTAGE | 09/10/2009 12:00 | 09/10/2009 17:00 | 5 hours | Work is progressing restoring the Castor databases with the plan of restarting services tomorrow. The completion of this process, including a verification of Castor systems, is now estimated to be completed by the end of the afternoon. | 
| lhcb-lfc, lugh, ogma | UNSCHEDULED | OUTAGE | 08/10/2009 16:00 | 09/10/2009 18:00 | 1 day, 2 hours | Extending downtime as the work to restore the services are still ongoing | 
| All Castor & CEs | UNSCHEDULED | OUTAGE | 08/10/2009 12:00 | 09/10/2009 12:00 | 24 hours | Following hardware issues with the systems that host the Oracle databases behind Castor we are migrating the data to alternative hardware. Some issues have been encounterered with restoring the databasaes ahead of the migration. We are therefore extending the Castor downtime. | 
| lhcb-lfc | UNSCHEDULED | OUTAGE | 07/10/2009 17:30 | 08/10/2009 17:00 | 23 hours and 30 minutes | The LHCb 3D database (including with lhcb-lfc) is being moved to an alternative system while investigations are ongoing into the hardware failures that have been encountered. | 
| All Castor & CEs | UNSCHEDULED | OUTAGE | 07/10/2009 14:00 | 08/10/2009 12:00 | 22 hours | Following the failure of the disk systems that host the Oracle databases behind Castor we are having to restore the databases from backup. We will run on alternative hardware temporarily while the current hardware problems are understood and the system re-certified. | 
| lcgftm, lcgfts, lfc-atlas, lfc, lhcb-lfc | UNSCHEDULED | OUTAGE | 07/10/2009 14:00 | 07/10/2009 16:00 | 2 hours | The restoration of the databases behind the LFC and FTS to alternative hardware is taking place but a little behind the schedule previously announced. This is a 2 hour extension to the outages for these services. | 
| lugh, ogma | UNSCHEDULED | OUTAGE | 07/10/2009 14:00 | 08/10/2009 17:00 | 1 day, 3 hours | The hardware that hosts the 3D databases has become too unstable. These databases are being migrated to alternative hardware before resuming the service. | 
| lcgftm, lcgfts, lfc-atlas, lfc, lhcb-lfc, lugh, ogma | UNSCHEDULED | OUTAGE | 06/10/2009 18:00 | 07/10/2009 16:00 | 22 hours | Following the failure of the disk systems that host the Oracle databases behind these services we are having restore some of the databases from backup and will migrate to alternative hardware while the underlying problems are resolved. | 
| lcgftm, lcgfts, lfc-atlas, lfc, lhcb-lfc, lugh, ogma | UNSCHEDULED | OUTAGE | 06/10/2009 14:03 | 06/10/2009 18:00 | 3 hours and 57 minutes | Outage while the problem on the Oracle systems behind the LFC and FTS systems is being investigated. | 
| All Castor & CEs | UNSCHEDULED | OUTAGE | 06/10/2009 14:00 | 07/10/2009 14:00 | 24 hours | Outage to investigate the ongoing problems with the hardware behind the Castor Oracle database. | 
| ftm, lcgfts, lfc-atlas, lfc, lhcb-lfc | UNSCHEDULED | OUTAGE | 06/10/2009 12:25 | 06/10/2009 14:30 | 2 hours and 5 minutes | We have just had a problem on the Oracle systems behind the LFC and FTS systems. Being looked at. | 
| lcgce06, lcgce08, srm-atlas, srm-lhcb | SCHEDULED | OUTAGE | 06/10/2009 10:00 | 06/10/2009 12:00 | 2 hours | Outage for reconfiguration of the Oracle RAC behind the Atlas and LHCb Castor instances. This is to remove a faulty node within the RAC. | 
| lcgfts.gridpp.rl.ac.uk, | SCHEDULED | AT_RISK | 06/10/2009 09:00 | 06/10/2009 12:00 | 3 hours | At Risk while channels for RAL for Atlas and LHCb drained out ahead of Castor intervention. Other channles will be unaffected. | 
| All Castor & CEs | UNSCHEDULED | OUTAGE | 05/10/2009 14:00 | 06/10/2009 14:00 | 24 hours | Ongoing problems with the hardware that underlies the Oracle databases behind Castor are being investigated. The cause for multiple failures is not yet understood and we are announcing an extended downtime as these investigations contine. | 
