RAL Tier1 Operations Report for 2nd October 2013
| Review of Issues during the fortnight 18th September to 2nd October 2013.
|
- As reported at the last meeting: There was a problem in the Oracle database behind Castor overnight 17/18 Sep. This was a problem of which we were aware and the fix was to change an Oracle parameter. A Castor outage, with a batch stop/pause was carried out on Wednesday 18th in order to pick up the changed parameter.
- On Thursday (19th) there was a problem on some batch worker nodes with /tmp filling up and the nodes being set offline. This was traced to some ALICE jobs. Good communications with ALICE enabled a prompt resolution.
- On Saturday (28th Sep) a problem with a CVMFS repository at CERN caused problems, in particular with /cvmfs/lhcb-conddb. This caused the Condor farm's healthcheck to fail. For a couple of hours (until this test was removed) Condor batch jobs didn't start.
- There have been some problems with the Torque/maui batch server. This affected Alice test jobs on Saturday (28th). The problem was resolved on Tuesday (1st Oct) when some stuck batch jobs were cleaned up.
| Resolved Disk Server Issues
|
- GDSS670 (AliceDisk - D1T0) failed on Sunday (22nd Sep). A RAID verify failed owing to a faulty disk. It was returned to service the next day.
- GDSS595 (GenTape - D0T1) was unavailable for a couple of hours on Monday (23rd Sep). The system needed to be restarted as it wouldn't see a replacement disk drive.
- GDS611 (LHCbDst - D1T0) failed on Tuesday (24th Sep). The disk controller was reporting many failed drives. The system was returned to service on Thursday (26th Sep) and has been drained for further investigation.
| Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
- The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs will be upgraded to SL6 tomorrow. This farm is expected to restart with initially with around 20% of the total batch capacity tomorrow. Remaining nodes being added over the following days.
- Just before this meeting we have a problem that is believed to be in a network switch stack. We lost access to some older servers (including those in Atlas HotDisk and some older worker nodes. Investigations are ongoing.
| Ongoing Disk Server Issues
|
- GDSS673 (CMSDisk - D1T0) is out of production. It failed on Saturday (28th Sep) - possibly due to a disk failing during a RAID verify. The system was returned to service on Monday (30th Sep). However, it failed again when another disk failed while it was still rebuilding the RAID array.
| Notable Changes made this last fortnight.
|
- The SRMs have been upgraded to SL5.9 with updated errata and kernel (lcgsrm13 on Monday 23rd, the remaining Atlas SRMs on Wed. 25th, all others on Thursday 26th).
- FTS3 was upgraded to version 3.1.14-1 on Wed 25th and to version 3.1.16-1 the next day.
- On Thursday 26th - Completion of upgrade of all three Top-BDII nodes (lcgbdii01, lcgbdii03, lcgbdii04) to the latest EMI v3.8.0 release.
- The BDII component on a number of systems (including ARC and CREAM CEs for the Condor farm) has been upgraded.
- Condor farm CEs set to Production in the GOC DB on Monday (30th). At this point the Condor farm had around 50% of total batch capacity.
- Tuesday 1st October: Update to Janet6 infrastructure for the RAL site connections and the backup OPN link to CERN.
- LCGCE01,02,04,10,11 (The Torque/Maui farm) in an Outage as the WNs are upgraded to SL6.
| Advanced warning for other interventions
|
| The following items are being discussed and are still to be formally scheduled and announced.
|
- Tomorrow (Thursday 3rd October) - upgrade of the Torque/Maui batch farm WNs to SL6. Drain already underway.
- Monday 7th October: Replacement of fans in UPS (UPS not available for 4-5 hours).
- On Tuesday 8th October the primary RAL OPN link to CERN will migrate to the SuperJanet 6 infrastructure.
- Re-establishing the paired (2*10Gbit) link to the UKLight router.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Grid Services
- Testing of alternative batch systems (Condor) along with ARC & CREAM CEs and SL6 Worker Nodes.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
- Infrastructure:
- A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
- Intervention required on the "Essential Power Board".
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
| Entries in GOC DB starting between the 18th September and 2nd October 2013.
|
There was one unscheduled outage in the GOC DB for this period. This is for the stop of Castor (and batch) following a database problem.
| Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
| All Castor (all SRM end points) and all batch (All CEs)
|
UNSCHEDULED
|
OUTAGE
|
18/09/2013 11:45
|
18/09/2013 14:45
|
3 hours
|
We are seeing errors in the database behind Castor. A Castor restart will be done to fix this. Will also pause batch jobs during this time.
|
| lcgwms05.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
06/09/2013 13:00
|
11/09/2013 10:55
|
4 days, 21 hours and 55 minutes
|
Upgrade to EMI-3
|
| lcgce12.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
05/09/2013 13:00
|
04/10/2013 13:00
|
29 days,
|
CE (and the SL6 batch queue behind it) being decommissioned.
|
| Open GGUS Tickets (Snapshot at time of meeting)
|
| GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
| 97516
|
Red
|
Urgent
|
Waiting Reply
|
2013-09-23
|
2013-09-30
|
T2K
|
[SE][StatusOfPutRequest][SRM_REQUEST_INPROGRESS] errors.
|
| 97479
|
Red
|
Very Urgent
|
On Hold
|
2013-09-20
|
2013-09-30
|
Atlas
|
RAL-LCG2, high job failure rate
|
| 97385
|
Red
|
Less Urgent
|
On Hold
|
2013-09-17
|
2013-09-26
|
HyperK
|
CVMFS for hyperk.org
|
| 97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2013-09-12
|
|
Myproxy server certificate does not contain hostname
|
| 95996
|
Red
|
Urgent
|
On Hold
|
2013-07-22
|
2013-09-17
|
OPS
|
SHA-2 test failing on lcgce01
|
| 91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-09-03
|
|
LFC webdav support
|
| 86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-06-17
|
|
correlated packet-loss on perfsonar host
|
| Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
| 18/09/13 |
97.2 |
100 |
90.6 |
93.2 |
93.0 |
Castor Stop for DB restart caused test failures across all Vos (except Alice)' Atlas also had a couple of other SRM SUM test failures;
|
| 19/09/13 |
100 |
100 |
100 |
100 |
100 |
|
| 20/09/13 |
100 |
100 |
100 |
100 |
100 |
|
| 21/09/13 |
100 |
100 |
100 |
100 |
100 |
|
| 22/09/13 |
100 |
100 |
100 |
100 |
100 |
|
| 23/09/13 |
100 |
100 |
100 |
100 |
100 |
|
| 24/09/13 |
100 |
100 |
100 |
100 |
100 |
|
| 25/09/13 |
100 |
100 |
98.1 |
100 |
100 |
Failed one SUM test during a SRM upgrade then another (Delete) failure overnight.
|
| 26/09/13 |
100 |
100 |
100 |
95.8 |
100 |
Single SRM test failure as SRMs were updated.
|
| 27/09/13 |
100 |
100 |
100 |
100 |
100 |
|
| 28/09/13 |
100 |
95.8 |
100 |
100 |
100 |
Batch problem (Cannot connect to batch server).
|
| 29/09/13 |
100 |
100 |
100 |
100 |
100 |
|
| 30/09/13 |
100 |
100 |
100 |
95.7 |
100 |
Single SRM test failure on PUT: Error reading token data header:
|
| 01/10/13 |
100 |
100 |
74.8 |
95.8 |
95.8 |
Atlas problem affected all sites; Single test failure for LHCb during Janet6 transition; Single test failure for CMS (timeout).
|