Tier1 Operations Report 2013-11-06
From GridPP Wiki
								
												
				RAL Tier1 Operations Report for 6th November 2013
| Review of Issues during the week 30th October to 6th November 2013. | 
- The Torque/Maui batch continued to run until its final drain with one of the batches of worker nodes disabled.
- The significant outages of services for the UPS work are referred to below. In order to keep the Top-BDII services up two replacement nodes were installed and placed on non-UPS power. However these have not run smoothly. LFC, MyProxy (lcgrbp01) and FTS3 services stayed up. Some services (FTS2, LFC, Atlas 3D/Frontier) were up most of the time - they suffered two short (1 - 2 hour) outages during yesterday (5th). Castor was down from 07:00 to 19:00 yesterday. Batch (CEs in front of Condor farm) from 07:00 yesterday until 13:00 today.
- The "uklight" router stopped during the afternoon of Tuesday 5th November when the power to its rack failed. This is used by the links to CERN and the 'bypass' route to Tier2s. This did not cause any operational problems as other services were down for the UPS work in the main computer building. This failure was nothing to do with the planned work - it was just coincidental. The uklighht router is in a different building.
| Resolved Disk Server Issues | 
- None
| Current operational status and issues | 
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
- The uplink from the Tier1 core switch to the UK Light router that was doubled last week has been working OK since that change.
| Ongoing Disk Server Issues | 
- GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
| Notable Changes made this last week. | 
- There was a significant outage yesterday and this morning for work on the UPS - changes to the "Essential Power Board" and an electrical safety check during which all UPS circuits were tested.
- The Torque/Maui batch farm has been stopped. Its worker Nodes will be moved into the Condor farm. The CREAM CEs that served this farm (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) are in a long downtime in the GOC DB ahead of decommissioning.
| Declared in the GOC DB | 
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 30/11/2013 23:59 | 25 days, 16 hours and 59 minutes | Service being decommissioned. | 
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
Listing by category:
-  Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
 
-  Networking:
-  Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
 
 
-  Update core Tier1 network and change connection to site and OPN including:
-  Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
 
| Entries in GOC DB starting between the 30th October and 6th November 2013. | 
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 30/11/2013 23:59 | 25 days, 16 hours and 59 minutes | Service being decommissioned. | 
| All Castor (all SRMs), Atlas Frontier (lcgft-atlas.gridpp.rl.ac.uk) | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 05/11/2013 19:13 | 12 hours and 13 minutes | Stop of systems (Castor, Frontier/3D database) during work on Uninterruptible Power Supply (UPS). | 
| All Batch (arc-ce01, arc-ce02, arc-ce03, cream-ce01, cream-ce02, atlas-squid, cms-squid, VO boxes, WMSs (lcgwms04, lcgwms05, lcgwms06), perfsonar (perfsonar-ps01, perfsonar-ps02) | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 06/11/2013 15:00 | 1 day, 8 hours | Stop of systems (Batch, WMS) during work on Uninterruptible Power Supply (UPS). | 
| lcgbdii, site-bdii, lcgfts, lcgrbp01, myproxy, lfc. | SCHEDULED | WARNING | 05/11/2013 07:00 | 06/11/2013 12:00 | 1 day, 5 hours | Warning (At Risk) on services during intervention on Uninterruptible Power Supply (UPS). Some services (LFC, FTS) will experience two breaks of around one to two hours during this period. | 
| All WMSs (lcgwms04, lcgwms05, lcgwms06) | SCHEDULED | OUTAGE | 01/11/2013 12:00 | 05/11/2013 07:00 | 3 days, 19 hours | Drain of WMSs ahead of their shutdown during work on UPS. | 
| Open GGUS Tickets (Snapshot at time of meeting) | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
|---|---|---|---|---|---|---|---|
| 98625 | Green | Urgent | In Progress | 2013-11-04 | 2013-11-04 | LHCb | Data unavailable for Brazilian proxies at RAL-LCG2 | 
| 98249 | Red | Urgent | In Progress | 2013-10-21 | 2013-10-30 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 | 
| 98122 | Red | Less Urgent | In Progress | 2013-10-17 | 2013-10-30 | cernatschool | CVMFS access for the cernatschool.org VO | 
| 97868 | Red | Less Urgent | Waiting Reply | 2013-10-08 | 2013-10-30 | T2K | CVMFS for t2k.org | 
| 97759 | Red | Urgent | On Hold | 2013-10-04 | 2013-10-04 | OPS | SHA-2 test failing on lcgce01 | 
| 97385 | Red | Less Urgent | In Progress | 2013-09-17 | 2013-10-14 | HyperK | CVMFS for hyperk.org | 
| 97025 | Red | Less urgent | On Hold | 2013-09-03 | 2013-05-11 | Myproxy server certificate does not contain hostname | |
| 91658 | Red | Less Urgent | On Hold | 2013-02-20 | 2013-09-03 | LFC webdav support | |
| 86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host | 
| Availability Report | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
|---|---|---|---|---|---|---|
| 30/10/13 | 100 | 100 | 100 | 100 | 100 | |
| 31/10/13 | 100 | 100 | 100 | 97.7 | 100 | Single SRM test failure (on Put) "Error reading token data header:" | 
| 01/11/13 | 100 | 100 | 99.0 | 100 | 100 | Provoked by local Atlas file deletions taking place at the same time. | 
| 02/11/13 | 100 | 100 | 100 | 100 | 100 | |
| 03/11/13 | 100 | 100 | 100 | 100 | 100 | |
| 04/11/13 | 100 | 85.0 | 71.3 | 100 | 74.6 | Mainly drain of batch ahead of tomorrow's UPS work. Atlas also had Single SRM SUM test failure. | 
| 05/11/13 | 29.2 | 0 | 0 | 48.9 | 0 | Effect of outage for UPS Work in R89. | 
