Difference between revisions of "Tier1 Operations Report 2013-10-30"
From GridPP Wiki
								
												
				| Gareth smith  (Talk | contribs)  | 
| (No difference) | 
Latest revision as of 10:54, 30 October 2013
RAL Tier1 Operations Report for 30th October 2013
| Review of Issues during the week 23rd to 30th October 2013. | 
- The Torque/Maui batch still has one of the batches of worker nodes disabled. Apart from that it has run reasonably well. The Condor farm has run OK.
- Two files were declared lost to Atlas following the failure of GDSS720. These were in transit as the server went down.
| Resolved Disk Server Issues | 
- None
| Current operational status and issues | 
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
- We are running with the two farms, Condor and Torque/Maui, in production. The Torque/Maui farm will be decommissioned after the intervention next week and its nodes moved into the Condor farm.
- The uplink from the Tier1 core switch to the UK Light router that was doubled last week has been working OK since that change.
| Ongoing Disk Server Issues | 
- GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
| Notable Changes made this last week. | 
- CVMFS client version 2.1.15-1 has been rolled out to all worker nodes in the Condor farm.
- A further update was applied to FTS3 last Wednesday, 23rd Oct. (Upgraded to 3.1.33-1).
| Declared in the GOC DB | 
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| BDIIs (lcgbdii, site-bdii), lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, Myproxy (lcgrbp01, myproxy) | SCHEDULED | WARNING | 05/11/2013 07:00 | 06/11/2013 12:00 | 1 day, 5 hours | Warning (At Risk) on services during intervention on Uninterruptible Power Supply (UPS). Some services (LFC, FTS) will experience two breaks of around one to two hours during this period. | 
| All Castor (all SRMs), Atlas Frontier | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 05/11/2013 21:00 | 14 hours | Stop of systems (Castor, Frontier/3D database) during work on Uninterruptible Power Supply (UPS). | 
| Condor batch farm (arc-ce01, arc-ce02, arc-ce03, cream-ce01, cream-ce02, lcgargus01, VO boxes, lcgapel01, atlas-squid, cms-squid, UIs (lcgui01, lcgui02), WMSs (lcgwms04, lcgwms05, lcgwms06), Perfsonar (perfsonar-ps01, perfsonar-ps02). | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 06/11/2013 15:00 | 1 day, 8 hours | Stop of systems (Batch, WMS) during work on Uninterruptible Power Supply (UPS). | 
| lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 30/11/2013 23:59 | 25 days, 16 hours and 59 minutes | Service being decommissioned. | 
| lcgwms04, lcgwms05, lcgwms06 | SCHEDULED | OUTAGE | 01/11/2013 12:00 | 05/11/2013 07:00 | 3 days, 19 hours | Drain of WMSs ahead of their shutdown during work on UPS. | 
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
- Interruption to services over Tuesday/Wednesday 5/6 November during work on the UPS and safety testing of its circuits. Outages and Warnings declared in GOC DB.
Listing by category:
-  Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
- None
 
-  Networking:
-  Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
 
 
-  Update core Tier1 network and change connection to site and OPN including:
-  Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
 
| Entries in GOC DB starting between the 23rd and 30th October 2013. | 
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| All Castor (all SRMs), batch (All CEs),lcgfts, lfc | SCHEDULED | OUTAGE | 23/10/2013 09:45 | 23/10/2013 12:15 | 2 hours and 30 minutes | Upgrade (doubling) of network data link. Some risk of disruption to our Tier1 network so some services stopped during the work. Other services at risk, | 
| All systems not in the above outage. | SCHEDULED | WARNING | 23/10/2013 09:45 | 23/10/2013 12:15 | 2 hours and 30 minutes | Upgrade (doubling) of network data link. Some risk of disruption to our Tier1 network - some services At Risk. (Other services declared down in separate GOC DB entry). | 
| Open GGUS Tickets (Snapshot at time of meeting) | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
|---|---|---|---|---|---|---|---|
| 98337 | Amber | Urgent | In Progress | 2013-10-23 | 2013-10-23 | Mice | Slow file uploads to castor (MICE) | 
| 98249 | Red | Urgent | In Progress | 2013-10-21 | 2013-10-30 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 | 
| 98214 | Red | Less Urgent | In Progress | 2013-10-19 | 2013-10-21 | CMS | HC Job failure reading dataset from T1_UK_RAL storage | 
| 98122 | Red | Less Urgent | In Progress | 2013-10-17 | 2013-10-30 | cernatschool | CVMFS access for the cernatschool.org VO | 
| 97868 | Red | Less Urgent | Waiting Reply | 2013-10-08 | 2013-10-30 | T2K | CVMFS for t2k.org | 
| 97759 | Red | Urgent | On Hold | 2013-10-04 | 2013-10-04 | OPS | SHA-2 test failing on lcgce01 | 
| 97385 | Red | Less Urgent | In Progress | 2013-09-17 | 2013-10-14 | HyperK | CVMFS for hyperk.org | 
| 97025 | Red | Less urgent | On Hold | 2013-09-03 | 2013-09-12 | Myproxy server certificate does not contain hostname | |
| 91658 | Red | Less Urgent | On Hold | 2013-02-20 | 2013-09-03 | LFC webdav support | |
| 86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host | 
| Availability Report | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
|---|---|---|---|---|---|---|
| 23/10/13 | 89.6 | 89.6 | 87.4 | 89.6 | 89.6 | Systems stopped for doubling of data uplink. | 
| 24/10/13 | 100 | 100 | 85.9 | 100 | 100 | Atlas Castor problem caused by a draining disk server. | 
| 25/10/13 | 100 | 100 | 100 | 100 | 100 | |
| 26/10/13 | 100 | 100 | 99.5 | 100 | 100 | Single SRM test failure "Error reading token data header:" | 
| 27/10/13 | 100 | 100 | 100 | 100 | 100 | |
| 28/10/13 | 100 | 100 | 100 | 100 | 100 | |
| 29/10/13 | 100 | 100 | 100 | 95.9 | 100 | Single SRM test failure "Error reading token data header:" | 
