Tier1 Operations Report 2013-11-13
From GridPP Wiki
								
												
				RAL Tier1 Operations Report for 13th November 2013
| Review of Issues during the week 6th to 13th November 2013. | 
- Service were watched closely following the work on the UPS Tuesday/Wednesday last week. A UPS/Generator load test was carried out successfully this morning.
- One batch of worker nodes has continued to give problems and has not been in in production.
- One file has been reported lost to ILC. The file was found to be corrupt when investigating why it would not migrate to tape.
| Resolved Disk Server Issues | 
- None
| Current operational status and issues | 
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
| Ongoing Disk Server Issues | 
- GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
| Notable Changes made this last week. | 
- We are now running with just the one (Condor) batch farm. Nodes that were in the Torque/Maui farm when it was stopped last week have been re-configured and added to the Condor farm. The CEs that front the old Torque/Maui farm (lcgce01,02,04,10,11) have been set as not in production in the GOC DB.
- A UPS/generator load test was successfully carried out this morning (Wed 13th Nov). This test was scheduled following the work on the UPS last week.
| Declared in the GOC DB | 
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 30/11/2013 23:59 | 25 days, 16 hours and 59 minutes | Service being decommissioned. | 
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
Listing by category:
-  Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
 
-  Networking:
-  Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
 
 
-  Update core Tier1 network and change connection to site and OPN including:
-  Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
 
| Entries in GOC DB starting between the 6th and 13th November 2013. | 
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| Whole Site. | SCHEDULED | WARNING | 13/11/2013 10:00 | 13/11/2013 12:00 | 2 hours | RAL site in warning state due to power generator test. | 
| CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 30/11/2013 23:59 | 25 days, 16 hours and 59 minutes | Service being decommissioned. | 
| Open GGUS Tickets (Snapshot at time of meeting) | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
|---|---|---|---|---|---|---|---|
| 98838 | Green | Urgent | In Progress | 2013-11-13 | 2013-11-13 | T2K | no jobs delegated to cream-ce0* | 
| 98833 | Green | Less Urgent | In Progress | 2013-11-12 | 2013-11-13 | SNO+ | Adoption of backup GridPP VOMS servers: lcglb03.gridpp.rl.ac.uk | 
| 98764 | Green | Less Urgent | Waiting Reply | 2013-11-08 | 2013-11-11 | SNO+ | Storage request | 
| 98625 | Red | Urgent | In Progress | 2013-11-04 | 2013-11-12 | LHCb | Data unavailable for Brazilian proxies at RAL-LCG2 | 
| 98249 | Red | Urgent | In Progress | 2013-10-21 | 2013-10-30 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 | 
| 98122 | Red | Less Urgent | In Progress | 2013-10-17 | 2013-10-30 | cernatschool | CVMFS access for the cernatschool.org VO | 
| 97868 | Red | Less Urgent | Waiting Reply | 2013-10-08 | 2013-10-30 | T2K | CVMFS for t2k.org | 
| 97759 | Red | Urgent | On Hold | 2013-10-04 | 2013-11-07 | OPS | SHA-2 test failing on lcgce01 | 
| 97385 | Red | Less Urgent | In Progress | 2013-09-17 | 2013-10-14 | HyperK | CVMFS for hyperk.org | 
| 97025 | Red | Less urgent | On Hold | 2013-09-03 | 2013-05-11 | Myproxy server certificate does not contain hostname | |
| 91658 | Red | Less Urgent | On Hold | 2013-02-20 | 2013-11-13 | LFC webdav support | |
| 86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host | 
| Availability Report | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
|---|---|---|---|---|---|---|
| 06/11/13 | 46.2 | 46.2 | 0 | 100 | 46.2 | Batch not restarted until the middle of the day owing to the UPS intervention. | 
| 07/11/13 | 100 | 100 | 62.3 | 100 | 100 | Atlas remained "not available" until the 'old' CE for the Torque/Maui batch farm were marked out of production in the GOC DB. | 
| 08/11/13 | 100 | 100 | 100 | 100 | 100 | |
| 09/11/13 | 100 | 100 | 100 | 100 | 100 | |
| 10/11/13 | 100 | 100 | 100 | 100 | 100 | |
| 11/11/13 | 100 | 100 | 99.1 | 100 | 100 | Single SRM test failure "could not open connection to srm-atlas.gridpp.rl.ac.uk" | 
| 12/11/13 | 100 | 100 | 100 | 100 | 100 | 
