RAL Tier1 Operations Report for 8th May 2013
| Review of Issues during the week 1st to 8th May 2013. | 
-  On Thursday (2nd May) there was a problem accessing files on Atlas disk server (GDSS559). Investigations took place during the day. The 'stuck' files were manually copied off to resolve the immediate problem. The server was finally rebooted.
| Resolved Disk Server Issues | 
| Current operational status and issues | 
-  The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
-  The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
-  The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
-  We are participating in xrootd federated access tests for Atlas.
-  Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
-  There is an outstanding problem (and GGUS ticket) affecting the certificate on the MyProxy server.
| Ongoing Disk Server Issues | 
| Notable Changes made this last week | 
-  Yesterday morning (7th May) glibc was updated on the current Castor standby database nodes to bring them to the same level as the current production nodes.
-  This morning (Wed 8th May): Castor primary and standby databases switched.
-  The final disk controller firmware updates for the 2011 Clustervision batch of disk servers were done (for the Alice servers) on Thursday (2nd May).
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
-  Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).
-  Tuesday 21st May - Planned networking intervention at RAL.
-  The blocking issue regarding the Castor 2.1.13 upgrade has been resolved and the scheduling of this upgrade will proceed.
Listing by category:
-  Databases:
-  Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
-  Upgrade to version 2.1.13
 
-  Networking:
-  Single link to UKLight Router to be restored as paired (2*10Gbit) link.
-  Update core Tier1 network and change connection to site and OPN including:
-  Install new Routing layer for Tier1
-  Change the way the Tier1 connects to the RAL network. 
-  These changes will lead to the removal of the UKLight Router.
 
-  Addition of caching DNSs into the Tier1 network.
 
-  Grid Services
-  Testing of alternative batch systems (SLURM, Condor).
-  Upgrade of one remaining EMI-1 component (UI) being planned.
 
-  Fabric
-  One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
 
-  Infrastructure:
-  Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
-  Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
-  Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
 
| Entries in GOC DB starting between 1st and 8th May 2013. | 
There were no unscheduled outages during the last week.
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
| All Castor (SRM end points) and batch (all CEs) | SCHEDULED | OUTAGE | 08/05/2013 10:00 | 08/05/2013 13:00 | 3 hours | Stop of Castor storage system while primary and standby databases are switched over. During the stop no batch jobs will be started. Batch work already running may be paused (depending on the VO). | 
| All Castor ((SRM end points) | SCHEDULED | WARNING | 01/05/2013 08:00 | 01/05/2013 12:00 | 4 hours | At Risk on storage (Castor) services during application of Oracle patches to back end database systems. | 
| Open GGUS Tickets (Snapshot at time of meeting) | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
| 93149 | Green | Less Urgent | In Progress | 2013-05-06 | 2013-05-07 | CMS | T1_UK_RAL squid upgrade | 
| 93149 | Red | Less Urgent | On Hold | 2013-04-05 | 2013-04-08 | Atlas | RAL-LCG2: jobs failing with " cmtside command was timed out" | 
| 92266 | Red | Less Urgent | Waiting for Reply | 2013-03-06 | 2013-04-16 |  | Certificate for RAL myproxy server | 
| 91658 | Red | Less Urgent | On Hold | 2013-02-20 | 2013-04-03 |  | LFC webdav support | 
| 86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-03-19 |  | correlated packet-loss on perfsonar host | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
| 01/05/13 | 100 | 100 | 97.2 | 100 | 100 | "could not open connection to srm-atlas.gridpp.rl.ac.uk" while main Castor database was being patched. | 
| 02/05/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 03/05/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 04/05/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 05/05/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 06/05/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 07/05/13 | 100 | 100 | 100 | 100 | 100 |  |