RAL Tier1 Operations Report for 19th June 2013
| Review of Issues during the week 12th to 19th June 2013. | 
-  There are ongoing intermittent problems starting LHCb batch jobs as well as a more general problem of low job start rates.
-  We have been timing out on some LHCb CE tests - some of which are (presumably erroneously) ending up in the 'whole node' queue and taking a long time to get scheduled.
-  There have been some more issues with Castor for CMS. Some high load and hot files (or hot disk servers) was seen. On Monday afternoon (17th) GDSS583 (CMSDisk) was put into a passive draining mode for a while which redistributed some of the hot files.
| Resolved Disk Server Issues | 
-  GDSS720 (AtlasDataDisk - D1T0) crashed early in the morning of Thursday 13th May. A hardware fault (cpu/motherboard) has been identified. The system was returned to service later that afternoon and is being drained ahead of a hardware intervention. Two Atlas files were corrupted as they being written to Castor at the time of the server failure.
| Current operational status and issues | 
-  Until the re-test of the UPS/Generator (scheduled for next Tuesday) we cannot be certain the generator backup would kick in.
-  The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
-  The problem LHCb jobs failing due to long job set-up times remains and investigations continue. Recent updates to the CVMFS clients have improved the situation for Atlas.
-  The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
-  We are participating in xrootd federated access tests for Atlas.
-  Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
| Ongoing Disk Server Issues | 
| Notable Changes made this last week | 
-  On Monday (17th) seven disk servers (the last of the 2012 batch) totalling around 630TB were added to AtlasDataDisk.
-  On Tuesday (18th) the maintenance work to replace a battery in controls for electrical switchgear in R89 was completed successfully.
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
-  Following successful swap of the battery in the electrical control system this week, a UPS/Generator load test is planned (subject to final confirmation) for Tuesday morning (25th June).
-  We are confirming if there is any possible impact from scheduled maintenance on both CERN Primary & Backup links overnight Tue/Wed 25/26 June.
-  Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).
-  Scheduling of the Castor 2.1.13 upgrade is expected in the next few weeks.
Listing by category:
-  Databases:
-  Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
-  Upgrade to version 2.1.13
 
-  Networking:
-  Single link to UKLight Router to be restored as paired (2*10Gbit) link.
-  Update core Tier1 network and change connection to site and OPN including:
-  Install new Routing layer for Tier1
-  Change the way the Tier1 connects to the RAL network. 
-  These changes will lead to the removal of the UKLight Router.
 
 
-  Grid Services
-  Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
-  Upgrade of one remaining EMI-1 component (UI) being planned.
 
-  Fabric
-  One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
 
-  Infrastructure:
-  A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
-  Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
-  Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
-  Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
 
 
| Entries in GOC DB starting between 12th and 19th June 2013. | 
There were no unscheduled entries in the GOC DB starting during the last fortnight.
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
| Whole Site | SCHEDULED | WARNING | 18/06/2013 10:00 | 18/06/2013 10:40 | 40 minutes | Site at risk during intervention (to replace a battery) in controls for electrical switchgear. | 
| Open GGUS Tickets (Snapshot at time of meeting) | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
| 94959 | Green | Less Urgent | In Progress | 2013-06-19 | 2013-06-19 | CMS | Failed reprocessing jobs at T1_UK_RAL | 
| 94891 | Green | Urgent | In Progress | 2013-06-15 | 2013-06-15 | Atlas | RAL-LCG2_(HIMEM_)SL6: tasks are failing due to Athena Error | 
| 94755 | Red | Urgent | Waiting Reply | 2013-06-10 | 2013-06-12 |  | Error retrieving data from lcgwms04 | 
| 94731 | Red | Less Urgent | In Progress | 2013-06-07 | 2013-06-19 | cernatschool | WMS for cernatschool.org | 
| 94543 | Red | Less Urgent | In Progress | 2013-06-04 | 2013-06-11 | SNO+ | Job outputs not being retrieved | 
| 91658 | Red | Less Urgent | On Hold | 2013-02-20 | 2013-05-29 |  | LFC webdav support | 
| 86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-03-19 |  | correlated packet-loss on perfsonar host | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
| 12/06/13 | 100 | 100 | 99.1 | 100 | 100 | Single SRM test failure when deleting a file. | 
| 13/06/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 14/06/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 15/06/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 16/06/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 17/06/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 18/06/13 | 100 | 100 | 98.5 | 100 | 100 | Singe SRM PUT test failure. Draining caused backlog of pending transfers. |