Latest revision as of 11:52, 12 September 2012
RAL Tier1 Operations Report for 12th September 2012
| Review of Issues during the week 29th August and 5th September 2012 | 
-  Problem overnight Wed/Thu (5/6 Sep). One of the pair of uplinks to switch stack was failing intermittently. This caused problems on at least one of the switches in the stack. Rather than access to the connected systems just being degraded, access to some of the systems failed for periods. This led to failures accessing one batch of disk servers and some worker nodes. 
-  An update to the LHCb Castor stager to version 2.1.12 was announced for Tuesday morning (11th September) but was cancelled when a problem was found in testing the day before.
| Resolved Disk Server Issues | 
| Current operational status and issues | 
-  On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. In particular one half of the new switchboard has been refurbished and is on track to be brought into service by 17 September. Once this is operational then RAL will be switched over to using it and will no longer be dependent on the old switchgear.
-  The migration of LHCb date from the T10KA to the T10KC tapes is progressing.
| Ongoing Disk Server Issues | 
| Notable Changes made this last week | 
-  The rolling migration of (non-LHC) LFC front ends to EMI-2 on Virtual Machines is underway.
-  Continuing test of hyperthreading on one batch of worker nodes (the Dell 2011 batch). Problems have been seen when there are many cpu-bound jobs (Atlas monte-carlo) on the same node. These have taken longer to run on these nodes and exceeded maximum wall time. In response the overcommit of jobs was reduced on Tuesday (11th Sep). The total job slots were reduced from 24 to 18 on these 12-core nodes.
-  As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
-  A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
-  A test instance of FTS version 3 is now available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
Listing by category:
-  Databases:
-  Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
-  Upgrade to version 2.1.12. Expected to be ready imminently.
 
-  Networking:
-  Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
-  Update Spine layer for Tier1 network.
-  Replacement of UKLight Router.
-  Addition of caching DNSs into the Tier1 network.
 
-  Grid Services:
-  Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
 
-  Infrastructure:
-  Intervention required on the "Essential Power Board". (Should be an "At Risk"). Likely to be in November.
-  Remedial work on three (out of four) transformers. Will require two "At Risk" periods. Likely to be in November.
-  Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.  
 
| Entries in GOC DB starting between 29th August and 5th September 2012 | 
There were no Scheduled or Unscheduled entries in the GOC DB for this period. (Note: We did declare a downtime to upgrade the LHCb Castor Stager on Tuesday morning (11th) but that was cancelled.)
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
| 85889 | yellow | Less Urgent | In Progress | 2012-09-06 | 2012-09-11 | OPS | ops pilot role not enabled on lcgwms03.gridpp.rl.ac.uk | 
| 85077 | Red | Less Urgent | In progress | 2012-08-13 | 2012-09-03 | biomed | CE lcgce05.gridpp.rl.ac.uk job cannot register file on SE srm-biomed.gridpp.rl.ac.uk | 
| 85023 | Red | In Progress | Waiting Reply | 2012-08-09 | 2012-09-11 | SNO+ | WMS | 
| 84492 | Red | Urgent | In Progress | 2012-07-24 | 2012-08-31 | SNO+ | Job time/memory requirements not provided | 
| 68853 | Red | Less Urgent | On hold | 2011-03-22 | 2012-09-04 | N/A | Retirenment of SL4 and 32bit DPM Head nodes and Servers | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
| 01/09/12 | 100 | 100 | 100 | 100 | 100 |  | 
| 02/09/12 | 100 | 100 | 100 | 100 | 100 |  | 
| 03/09/12 | 100 | 100 | 99.2 | 100 | 100 | Failure to connect to srm-atlas.gridpp.rl.ac.uk | 
| 04/09/12 | 100 | 57.3 | 100 | 100 | 100 | Test fails version check on EMI2.1 nodes. | 
| 05/09/12 | 100 | 89.3 | 94.5 | 91.7 | 96.5 | Mainly problem on Tier1 Network Link causing problems for switch stack. | 
| 06/09/12 | 100 | 93.1 | 99.2 | 100 | 91.7 | Continued effect of Tier1 Network Link causing problems for switch stack. | 
| 07/09/12 | 100 | 100 | 100 | 100 | 100 |  | 
| 08/09/12 | 100 | 100 | 99.2 | 100 | 100 | Failure to connect to srm-atlas.gridpp.rl.ac.uk correlates with network router reload. | 
| 09/09/12 | 100 | 100 | 100 | 100 | 100 |  | 
| 10/09/12 | 100 | 100 | 100 | 100 | 100 |  | 
| 11/09/12 | 100 | 100 | 100 | 100 | 100 |  |