Tier1 Operations Report 2009-12-16
From GridPP Wiki
								
												
				Contents
RAL Tier1 Operations Report for 16th December 2009.
This is a review of issues since the last meeting on 9th December.
Review of Issues during week 9th to 16th December.
- There was a tape recall problem for the Castor CMS and GEN instances that started over the weekend. This was resolved at around 9am on Tuesday morning. This caused significant problems for CMS transfers from RAL.
- There was a problem on lcgce02 (used by non-LHC VOs) over the weekend. The system had locked up and had to be restarted on Monday morning (14th).
- A post mortem for the double disk failure on gdss138, part of the LHCb_Dst space token (D1T0) with resulting data loss, was posted at: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091130
- Disk server gdss354 (part of Atlas MCDISK) was unavailable for just over an hour during 15th December for a reboot.
Current operational status and issues.
- Long standing Database Disk array problem. (No update since last week's report).
- There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.
- A mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007 and has been found for CMS data. So far investigations have not found other evidence of this problem. This affects 11 tapes with a total of 983 files on those tapes.
- Configuration issue on CREAM CE (lcgce01) caused problems for Monte-Carlo production jobs for CMS. Awaiting application of the fix.
- Ongoing problem on OPN (UKlight) link to Lancaster. Went downaround 17:30 Wednesday (9th). Report of broken/damaged fibre. Has been up and down a couple of times since. Currently (13:00, Wednesday 16th) link down. Networking team aware.
Advanced warning:
- Thursday 17th December: Outage on SL4 Alice VO box (lcgvo0597) ahead of withdrawal from service.
- Thursday 17th December: At Risk for Castor during migration of LSF triplet (LSF license servers). (TBC)
-  During first part of week beginning Monday 21st December:
- Turn off old home nfs file system. (Already replaced - just a tidy up) (TBC)
- Monday 21st December Migrate CIP (Castor Information Provider) to more resilient hardware. (TBC).
- Tuesday 22nd December. Reboot of disk servers to pick up new kernels. (Castor stop, Batch Pause).
 
-  Updates being planned for January.
- Tuesday 5th January: Test of UPS bypass. Databases will be stopped so will include a stop of services (Castor, LFC, FTS)
- Establishing further plans for January.
 
Table showing entries in GOC DB starting between 9th and 16th December.
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| site-bdii | SCHEDULED | AT_RISK | 10/12/2009 09:00 | 10/12/2009 10:00 | 1 hour | At Risk during change to improve resilience of way data from Castor is published. | 
