RAL Tier1 Operations Report for 2nd October 2013
| Review of Issues during the fortnight 18th September to 2nd October 2013. | 
-  As reported at the last meeting: There was a problem in the Oracle database behind Castor overnight 17/18 Sep. This was a problem of which we were aware and the fix was to change an Oracle parameter. A Castor outage, with a batch stop/pause was  carried out on Wednesday 18th in order to pick up the changed parameter.
-  On Thursday (19th) there was a problem on some batch worker nodes with /tmp filling up and the nodes being set offline. This was traced to some ALICE jobs. Good communications with ALICE enabled a prompt resolution.
-  On Saturday (28th Sep) a problem with a CVMFS repository at CERN caused problems, in particular with /cvmfs/lhcb-conddb. This caused the Condor farm's healthcheck to fail. For a couple of hours (until this test was removed) Condor batch jobs didn't start.
-  There have been some problems with the Torque/maui batch server. This affected Alice test jobs on Saturday (28th). The problem was resolved on Tuesday (1st Oct) when some stuck batch jobs were cleaned up.
| Resolved Disk Server Issues | 
-  GDSS670 (AliceDisk - D1T0) failed on Sunday (22nd Sep). A RAID verify failed owing to a faulty disk. It was returned to service the next day.
-  GDSS595 (GenTape - D0T1) was unavailable for a couple of hours on Monday (23rd Sep). The system needed to be restarted as it wouldn't see a replacement disk drive.
-   GDS611 (LHCbDst - D1T0) failed on Tuesday (24th Sep). The disk controller was reporting many failed drives. The system was returned to service on Thursday (26th Sep) and has been drained for further investigation.
| Current operational status and issues | 
-  The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
-  The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues. 
-  We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
-  The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs will be upgraded to SL6 tomorrow. This farm is expected to restart with initially with around 20% of the total batch capacity tomorrow. Remaining nodes being added over the following days.
-  Just before this meeting we have a problem that is believed to be in a network switch stack. We lost access to some older servers (including those in Atlas HotDisk and some older worker nodes. Investigations are ongoing.
| Ongoing Disk Server Issues | 
-  GDSS673 (CMSDisk - D1T0) is out of production. It failed on Saturday (28th Sep) - possibly due to a disk failing during a RAID verify. The system was returned to service on Monday (30th Sep). However, it failed again when another disk failed while it was still rebuilding the RAID array.
| Notable Changes made this last fortnight. | 
-  The SRMs have been upgraded to SL5.9 with updated errata and kernel (lcgsrm13 on Monday 23rd, the remaining Atlas SRMs on Wed. 25th, all others on Thursday 26th).
-  FTS3 was upgraded to version 3.1.14-1 on Wed 25th and to version 3.1.16-1 the next day.
-  On Thursday 26th - Completion of upgrade of all three Top-BDII nodes (lcgbdii01, lcgbdii03, lcgbdii04)  to the latest EMI v3.8.0 release.
-  The BDII component on a number of systems (including ARC and CREAM CEs for the Condor farm) has been upgraded.
-  Condor farm CEs set to Production in the GOC DB on Monday (30th). At this point the Condor farm had around 50% of total batch capacity.
-  Tuesday 1st October: Update to Janet6 infrastructure for the RAL site connections and the backup OPN link to CERN.
-  LCGCE01,02,04,10,11 (The Torque/Maui farm) in an Outage as the WNs are upgraded to SL6.
| Advanced warning for other interventions | 
| The following items are being discussed and are still to be formally scheduled and announced. | 
-  Tomorrow (Thursday 3rd October) - upgrade of the Torque/Maui batch farm WNs to SL6. Drain already underway.
-  Monday 7th October: Replacement of fans in UPS (UPS not available for 4-5 hours).
-  On Tuesday 8th October the primary RAL OPN link to CERN will migrate to the SuperJanet 6 infrastructure.
-  Re-establishing the paired (2*10Gbit) link to the UKLight router.
Listing by category:
-  Databases:
-  Switch LFC/FTS/3D to new Database Infrastructure.
 
-  Castor:
-  Networking:
-  Single link to UKLight Router to be restored as paired (2*10Gbit) link.
-  Update core Tier1 network and change connection to site and OPN including:
-  Install new Routing layer for Tier1
-  Change the way the Tier1 connects to the RAL network. 
-  These changes will lead to the removal of the UKLight Router.
 
 
-  Grid Services
-  Testing of alternative batch systems (Condor) along with ARC & CREAM CEs and SL6 Worker Nodes.
 
-  Fabric
-  One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
 
-  Infrastructure:
-  A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
-  Intervention required on the "Essential Power Board".
-  Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
-  Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
 
 
| Entries in GOC DB starting between the 18th September and 2nd October 2013. | 
There was one unscheduled outage in the GOC DB for this period. This is for the stop of Castor (and batch) following a database problem.
| Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason | 
| All Castor (all SRM end points) and all batch (All CEs) | UNSCHEDULED | OUTAGE | 18/09/2013 11:45 | 18/09/2013 14:45 | 3 hours | We are seeing errors in the database behind Castor. A Castor restart will be done to fix this. Will also pause batch jobs during this time. | 
| lcgwms05.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 06/09/2013 13:00 | 11/09/2013 10:55 | 4 days, 21 hours and 55 minutes | Upgrade to EMI-3 | 
| lcgce12.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 05/09/2013 13:00 | 04/10/2013 13:00 | 29 days, | CE (and the SL6 batch queue behind it) being decommissioned. | 
| Open GGUS Tickets (Snapshot at time of meeting) | 
| GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject | 
| 97516 | Red | Urgent | Waiting Reply | 2013-09-23 | 2013-09-30 | T2K | [SE][StatusOfPutRequest][SRM_REQUEST_INPROGRESS] errors. | 
| 97479 | Red | Very Urgent | On Hold | 2013-09-20 | 2013-09-30 | Atlas | RAL-LCG2, high job failure rate | 
| 97385 | Red | Less Urgent | On Hold | 2013-09-17 | 2013-09-26 | HyperK | CVMFS for hyperk.org | 
| 97025 | Red | Less urgent | On Hold | 2013-09-03 | 2013-09-12 |  | Myproxy server certificate does not contain hostname | 
| 95996 | Red | Urgent | On Hold | 2013-07-22 | 2013-09-17 | OPS | SHA-2 test failing on lcgce01 | 
| 91658 | Red | Less Urgent | On Hold | 2013-02-20 | 2013-09-03 |  | LFC webdav support | 
| 86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-06-17 |  | correlated packet-loss on perfsonar host | 
| Day | OPS | Alice | Atlas | CMS | LHCb | Comment | 
| 18/09/13 | 97.2 | 100 | 90.6 | 93.2 | 93.0 | Castor Stop for DB restart caused test failures across all Vos (except Alice)'  Atlas also had a couple of other SRM SUM test failures; | 
| 19/09/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 20/09/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 21/09/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 22/09/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 23/09/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 24/09/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 25/09/13 | 100 | 100 | 98.1 | 100 | 100 | Failed one SUM test during a SRM upgrade then another (Delete) failure overnight. | 
| 26/09/13 | 100 | 100 | 100 | 95.8 | 100 | Single SRM test failure as SRMs were updated. | 
| 27/09/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 28/09/13 | 100 | 95.8 | 100 | 100 | 100 | Batch problem (Cannot connect to batch server). | 
| 29/09/13 | 100 | 100 | 100 | 100 | 100 |  | 
| 30/09/13 | 100 | 100 | 100 | 95.7 | 100 | Single SRM test failure on PUT: Error reading token data header: | 
| 01/10/13 | 100 | 100 | 74.8 | 95.8 | 95.8 | Atlas problem affected all sites; Single test failure for LHCb during Janet6 transition; Single test failure for CMS (timeout). |