Tier1 Operations Report 2014-01-15
From GridPP Wiki
								
												
				RAL Tier1 Operations Report for 15th January 2014
| Review of Issues during the week 8th to 15th January 2014. | 
- On Monday afternoon (13th Jan) a minor operational problem led to the Castor Atlas instance being down for around 15 minutes from 15:00-15:15.
- The Atlas file renaming in Castor has been completed with around 17 million files renamed. We are still checking the missing files. However, the total number of lost files found in this processes is believed to be in line with other sites.
| Resolved Disk Server Issues | 
- None.
| Current operational status and issues | 
- None
| Ongoing Disk Server Issues | 
- None
| Notable Changes made this last week. | 
- Changed garbage collection threshold on all Castor D1T0 disk servers from 95% to 99%. This should lead to a 4% increase in usable space for the instance. The change was made for AtlasDataDisk on Monday (13th Jan) and all other instances on Tuesday (14th Jan).
- New CernVM-FS Stratum-0 and Stratum-1 services for the non-LHC VOs have been deployed and announced.
- The second (and final) tranche of disk servers in this year's purchase are currently being delivered.
| Declared in the GOC DB | 
- There is an entry for the retirement of two old (and replaced) Logging & Bookkeeping servers.
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
- On Thursday 16th January the disk caches in front of the Alice-Tape and Gen-tape pools will be merged.
- On the morning (09:00 - 13:00) of Tuesday 21st January there will be an upgrade to the microcode in the tape libraries. There will be no tape access during this time.
Listing by category:
-  Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
- Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
 
-  Networking:
- Implementation of new site firewall. (Date for Tier1 traffic to start using this is not yet agreed. Initial changes for links that do not affect the Tier1 commence on 20th January)
-  Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
 
 
-  Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
 
| Entries in GOC DB starting between the 8th and 15th January 2014. | 
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
|---|---|---|---|---|---|---|
| lcglb03.gridpp.rl.ac.uk, lcglb04.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 18/12/2013 11:00 | 31/01/2014 00:00 | 43 days, 13 hours | Old EMI-2 hosts to be retired | 
| Open GGUS Tickets (Snapshot during morning of meeting) | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
|---|---|---|---|---|---|---|---|
| 100180 | Green | Less Urgent | Waiting Reply | 2014-01-10 | 2014-01-10 | Hone | hone jobs submitted through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk into all Lyon's, all Imperial College's, 3 from 5 DESY-HH's, EFDA's and ITEP's cream queues are aborted immediately | 
| 100114 | Amber | Less Urgent | Waiting Reply | 2014-01-08 | 2014-01-10 | Jobs failing to get from RAL WMS to Imperial | |
| 100086 | Red | Less Urgent | In Progress | 2014-01-07 | 2014-01-13 | T2K | WMS jobs cleared too rapidly | 
| 99768 | Red | Less Urgent | Waiting Reply | 2013-12-13 | 2014-01-07 | Atlas | RAL-LCG2_DATADISK: transfer failures with "source file doesn't exist" | 
| 99647 | Red | Less Urgent | In Progress | 2013-12-12 | 2013-12-17 | SNO+ | lcg-cp connection timeouts | 
| 99556 | Red | Very Urgent | In Progress | 2013-12-06 | 2014-01-07 | NGI Argus requests for NGI_UK | |
| 98249 | Red | Urgent | In Progress | 2013-10-21 | 2014-01-14 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 | 
| 98122 | Red | Less Urgent | Waiting Reply | 2013-10-17 | 2014-01-14 | cernatschool | CVMFS access for the cernatschool.org VO | 
| 97025 | Red | Less urgent | On Hold | 2013-09-03 | 2014-01-06 | Myproxy server certificate does not contain hostname | |
| 86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host | 
| Availability Report | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
|---|---|---|---|---|---|---|
| 08/01/14 | 100 | 100 | 100 | 100 | 100 | |
| 09/01/14 | 100 | 100 | 100 | 100 | 100 | |
| 10/01/14 | 100 | 100 | 100 | 100 | 100 | |
| 11/01/14 | 100 | 100 | 100 | 95.5 | 100 | WMS at CERN found " no compatible resources" | 
| 12/01/14 | 100 | 100 | 100 | 100 | 100 | |
| 13/01/14 | 100 | 100 | 99.2 | 100 | 100 | Short outage of a Castor daemon. | 
| 14/01/14 | 100 | 100 | 100 | 95.9 | 100 | SRM Put test faiure "Invalid argument". | 
