RAL Tier1 Operations Report for 22nd August 2012
| Review of Issues during the week 15th to 22nd August 2012 | 
-  On Thursday morning (16th) there was a problem on one of the switches in a stack and we lost access to a batch of worker nodes for a little oven an hour. The fix (at 09:30) required the network switch stack to be reset, which broke connections to a batch of disk servers for between one and two minutes.
-  There have been a number of issues with SUM tests over the last week. On Sunday (19th) we failed CE tests for CMS owing to the test jobs being delayed by production work (resolved by altering the priority of the test jobs); Overnight Tues/Wed (21/22) we failed CE tests for Atlas due to a backlog of Atlas test and software install jobs. There have also been some sporadic failures of tests on the SRMs that are under investigation. This was compounded by a problem with the central distribution of all SUM results Thursday/Friday (16/17 Aug).
| Resolved Disk Server Issues | 
| Current operational status and issues | 
-  On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. In particular one half of the new switchboard has been refurbished and is on track to be brought into service by 17 September. Once this is operational then RAL will be switched over to using it and will no longer be dependent on the old switchgear.
| Ongoing Disk Server Issues | 
| Notable Changes made this last week | 
-  Site firewall re-configuration took place successfully in the morning of Tuesday 21st August. (We had declared a "Warning" (At Risk) for the Tier1 site and drained and stopped FTS during the intervention period.
-  Continuing test of hyperthreading, one batch of worker nodes (the Dell 2011 batch) has number of jobs increased further (from 20 to 22).
-  As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
-  A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5.
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
Listing by category:
-  Databases:
-  Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
-  Migration of LHCb data from "A" to "C" tape media.
-  Upgrade to version 2.1.12.
 
-  Networking:
-  Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
-  Update Spine layer for Tier1 network.
-  Replacement of UKLight Router.
-  Addition of caching DNSs into the Tier1 network.
 
-  Grid Services:
-  Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
 
-  Infrastructure:
-  Intervention required on the "Essential Power Board". (Following further checks this should not require a power outage in UPS room, but should be an "At Risk").
-  Remedial work on three (out of four) transformers. Will require two "At Risk" periods.
-  Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.  
 
| Entries in GOC DB starting between 15th and 22nd August 2012 | 
There were no Unscheduled entries in the GOC DB for this period.
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
| Whole Site | SCHEDULED | WARNING | 21/08/2012 08:00 | 21/08/2012 10:00 | 2 hours | Configuration changes on site firewall. Anticipate two short breaks (each of a few minutes) in this time window. Will stop the FTS (subject of separate GOC DB entry). | 
| FTS | SCHEDULED | OUTAGE | 21/08/2012 07:00 | 21/08/2012 10:00 | 3 hours | Drain and stop of FTS during firewall reconfiguration. | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
| 85023 | Red | Less Urgent | Waiting Reply | 2012-08-09 | 2012-08-10 | SNO+ | WMS | 
| 84492 | Red | Urgent | In Progress | 2012-07-24 | 2012-08-17 | SNO+ | Job time/memory requirements not provided | 
| 84408 | Red | Very Urgent | In Progress | 2012-07-20 | 2012-08-20 | neurogrid | Enable neurogrid.incf.org on WMS and LFC | 
| 68853 | Red | Less Urgent | On hold | 2011-03-22 | 2012-07-30 | N/A | Retirenment of SL4 and 32bit DPM Head nodes and Servers |