Latest revision as of 12:05, 24 October 2012
RAL Tier1 Operations Report for 24th October 2012
| Review of Issues during the week 17th to 24th October 2012 | 
-  During planned maintenance the OPN link to CERN failed over to the backup route from around 07:30 until 17:30 on Saturday 20th October.
-  During the rebooting of the LHCb disk servers while the Castor instance was being upgraded one of the disk servers re-installed itself as another disk server. No data was lost, but the server was out of production until later that afternoon and then a further fault was found and fixed the following morning.
-  During the afternoon of Tuesday 23rd Oct. one of the LHCb Castor headnodes showed a significant hardware fault and was replaced.
-  The FTS service failed (with a known bug) early yesterday evening (23rd Oct). The test for this failed to detect the problem and the service was down for most VOs until around 9am this morning (24th).
| Resolved Disk Server Issues | 
-  GDSS454 (AtlasDataDisk - D1T0) failed on 16th Oct. It was returned to production during the afternoon of 17th October. As reported at the last meeting one file was declared lost from this server.
-  GDSS639 (GENScratchDisk - D0T0) failed on Saturday morning (20th Oct). It was returned to production on Monday afternoon (22nd Oct) after faulty memory had been replaced.
-  GDSS213 (AtlasScratchDisk - D1T0) failed on Sunday afternoon (21st Oct). It was returned to production on Monday afternoon (22nd Oct).
-  GDSS535 (LHCbDst - D1T0) The system was re-installed as another node when rebooted during the LHCb Castor upgrade on Tuesday 23rd Oct. It was returned to production later that afternoon. However, a further problem was found on this server which was fixed during the following morning (24th).
| Current operational status and issues | 
-  At the moment we are failing the VO SUM tests for the CEs for a number of VOs. This reflects tests that have not yet moved to the new EMI CEs.
-  On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
-  High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). Ongoing work by Fabric team looking to improve the uplink.
-  Investigations are ongoing (using perfsonar) into asymmetric routing of data over (and not back over) the OPN. A problem has been resolved with routing from CNAF. The problem also appears with the North American Tier1 sites and is being followed up.
| Ongoing Disk Server Issues | 
| Notable Changes made this last week | 
-  WMS01 updated to EMI v3.3.8 
-  On 19th Oct an update to the castor information provider removed some unnecessary references to glite and fixed a problem of tape usage reporting.
-  23rd Oct - LHCb Castor instance upgraded to version 2.1.12-10.
-  23rd October glite CREAM CEs replaced with EMI CREAM CEs.
-  Hyperthreading continues to run on one batch of worker nodes ahead of it being rolled out on all suitable worker nodes.
-  As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
-  A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
-  Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.
-  Ongoing WMS02 update to EMI v3.3.8
-  Tuesday 30th October: Upgrade of GEN  Castor instance to Version 2.1.12-10. 
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
-  20th November: Intervention required on the "Essential Power Board" and transformers. (An "At Risk").
Listing by category:
-  Databases:
-  Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
-  Upgrade to version 2.1.12. (As detailed above).
 
-  Networking:
-  Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
-  Update Spine layer for Tier1 network.
-  Replacement of UKLight Router.
-  Addition of caching DNSs into the Tier1 network.
 
-  Grid Services:
-  CEs being upgraded to EMI version now.
-  Rolling upgrade of WMSs to EMI version underway.
-  Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).
 
Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
-  Infrastructure:
-  Intervention required on the "Essential Power Board".
-  Remedial work on three (out of four) transformers.
-  Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
 
| Entries in GOC DB starting between 17th and 24th October 2012 | 
There are two unscheduled outages in the GOC DB for this period. One is for the failure of one of the LHCb Castor headnodes, the other is for the new EMI CREAM CEs (not in production at that time).
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
| srm-lhcb | UNSCHEDULED | WARNING | 23/10/2012 16:30 | 24/10/2012 12:30 | 20 hours | At risk due to hardware fault on castor headnode. Services are being moved to alternative hardware. | 
| lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 | SCHEDULED | WARNING | 23/10/2012 10:00 | 24/10/2012 12:00 | 1 day, 2 hours | post EMI-2 CREAM migration | 
| lcgce03, lcgce05, lcgce07, lcgce08, lcgce09 | SCHEDULED | OUTAGE | 23/10/2012 09:00 | 30/11/2012 12:00 | 38 days, 4 hours | replacement with EMI-2 CREAM nodes | 
| srm-lhcb | SCHEDULED | OUTAGE | 23/10/2012 08:00 | 23/10/2012 10:50 | 2 hours and 50 minutes | Upgrade of LHCb Castor instance to Version 2.1.12-10 | 
| lcgwms02 | SCHEDULED | OUTAGE | 21/10/2012 10:00 | 26/10/2012 13:00 | 5 days, 3 hours | EMI WMS upgrade to v3.3.8 | 
| lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 | UNSCHEDULED | OUTAGE | 19/10/2012 15:00 | 23/10/2012 10:00 | 3 days, 19 hours | migration to EMI-2 CREAM | 
| lcgwms01 | SCHEDULED | OUTAGE | 19/10/2012 13:00 | 22/10/2012 15:00 | 3 days, 2 hours | EMI WMS upgrade to v3.3.8 | 
| lcgwms01 | SCHEDULED | OUTAGE | 17/10/2012 15:00 | 19/10/2012 13:00 | 1 day, 22 hours | EMI WMS update to v3.3.8 | 
| lcgwms01 | SCHEDULED | OUTAGE | 12/10/2012 10:00 | 17/10/2012 15:00 | 5 days, 5 hours | EMI WMS update to v3.3.8 | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
| 86705 | Red | Less Urgent | In Progress | 2012-10-03 | 2012-10-23 | SNO+ | RAL jobs returning errors | 
| 86690 | Red | Urgent | In Progress | 2012-10-03 | 2012-10-22 | T2K | JPKEKCRC02 missing from FTS ganglia metrics | 
| 86152 | Red | Less Urgent | In Progress | 2012-09-17 | 2012-10-22 |  | correlated packet-loss on perfsonar host | 
| 68853 | Red | Less Urgent | In Progress | 2011-03-22 | 2012-10-23 | N/A | Retirenment of SL4 and 32bit DPM Head nodes and Servers | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
| 17/10/12 | 96.0 | 100 | 100 | 100 | 100 | CE07 had a problem (according to tests). This coincided with a block of missing data. | 
| 18/10/12 | 100 | 100 | 100 | 100 | 100 |  | 
| 19/10/12 | 100 | 100 | 100 | 100 | 100 |  | 
| 20/10/12 | 100 | 100 | 99.1 | 100 | 100 | Single failure of SRM Put at 07:46 ("zero number of replicas"); | 
| 21/10/12 | 100 | 100 | 98.2 | 100 | 100 | Failures of SRM Get at 02:05 & 02:19 ("could not open connection to srm-atlas.gridpp.rl.ac.uk") | 
| 22/10/12 | 100 | 100 | 100 | 100 | 100 |  | 
| 23/10/12 | 92.6 | 33.3 | 33.3 | 82.0 | 29.2 | Mainly effect of replacing glite CREAM CEs with EMI CREAM CEs. Some effect on LHCb from castor upgrade. 
 |