RAL Tier1 Operations Report for 2nd October 2013

Review of Issues during the fortnight 18th September to 2nd October 2013.

As reported at the last meeting: There was a problem in the Oracle database behind Castor overnight 17/18 Sep. This was a problem of which we were aware and the fix was to change an Oracle parameter. A Castor outage, with a batch stop/pause was carried out on Wednesday 18th in order to pick up the changed parameter.
On Thursday (19th) there was a problem on some batch worker nodes with /tmp filling up and the nodes being set offline. This was traced to some ALICE jobs. Good communications with ALICE enabled a prompt resolution.
On Saturday (28th Sep) a problem with a CVMFS repository at CERN caused problems, in particular with /cvmfs/lhcb-conddb. This caused the Condor farm's healthcheck to fail. For a couple of hours (until this test was removed) Condor batch jobs didn't start.
There have been some problems with the Torque/maui batch server. This affected Alice test jobs on Saturday (28th). The problem was resolved on Tuesday (1st Oct) when some stuck batch jobs were cleaned up.

Resolved Disk Server Issues

GDSS670 (AliceDisk - D1T0) failed on Sunday (22nd Sep). A RAID verify failed owing to a faulty disk. It was returned to service the next day.
GDSS595 (GenTape - D0T1) was unavailable for a couple of hours on Monday (23rd Sep). The system needed to be restarted as it wouldn't see a replacement disk drive.
GDS611 (LHCbDst - D1T0) failed on Tuesday (24th Sep). The disk controller was reporting many failed drives. The system was returned to service on Thursday (26th Sep) and has been drained for further investigation.

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs will be upgraded to SL6 tomorrow. This farm is expected to restart with initially with around 20% of the total batch capacity tomorrow. Remaining nodes being added over the following days.
Just before this meeting we have a problem that is believed to be in a network switch stack. We lost access to some older servers (including those in Atlas HotDisk and some older worker nodes. Investigations are ongoing.

Ongoing Disk Server Issues

GDSS673 (CMSDisk - D1T0) is out of production. It failed on Saturday (28th Sep) - possibly due to a disk failing during a RAID verify. The system was returned to service on Monday (30th Sep). However, it failed again when another disk failed while it was still rebuilding the RAID array.

Notable Changes made this last fortnight.

The SRMs have been upgraded to SL5.9 with updated errata and kernel (lcgsrm13 on Monday 23rd, the remaining Atlas SRMs on Wed. 25th, all others on Thursday 26th).
FTS3 was upgraded to version 3.1.14-1 on Wed 25th and to version 3.1.16-1 the next day.
On Thursday 26th - Completion of upgrade of all three Top-BDII nodes (lcgbdii01, lcgbdii03, lcgbdii04) to the latest EMI v3.8.0 release.
The BDII component on a number of systems (including ARC and CREAM CEs for the Condor farm) has been upgraded.
Condor farm CEs set to Production in the GOC DB on Monday (30th). At this point the Condor farm had around 50% of total batch capacity.
Tuesday 1st October: Update to Janet6 infrastructure for the RAL site connections and the backup OPN link to CERN.

Declared in the GOC DB

LCGCE01,02,04,10,11 (The Torque/Maui farm) in an Outage as the WNs are upgraded to SL6.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Tomorrow (Thursday 3rd October) - upgrade of the Torque/Maui batch farm WNs to SL6. Drain already underway.
Monday 7th October: Replacement of fans in UPS (UPS not available for 4-5 hours).
On Tuesday 8th October the primary RAL OPN link to CERN will migrate to the SuperJanet 6 infrastructure.
Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- None
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Grid Services
- Testing of alternative batch systems (Condor) along with ARC & CREAM CEs and SL6 Worker Nodes.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Infrastructure:
- A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
  - Intervention required on the "Essential Power Board".
  - Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
  - Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.

Entries in GOC DB starting between the 18th September and 2nd October 2013.

There was one unscheduled outage in the GOC DB for this period. This is for the stop of Castor (and batch) following a database problem.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All Castor (all SRM end points) and all batch (All CEs)	UNSCHEDULED	OUTAGE	18/09/2013 11:45	18/09/2013 14:45	3 hours	We are seeing errors in the database behind Castor. A Castor restart will be done to fix this. Will also pause batch jobs during this time.
lcgwms05.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	06/09/2013 13:00	11/09/2013 10:55	4 days, 21 hours and 55 minutes	Upgrade to EMI-3
lcgce12.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	05/09/2013 13:00	04/10/2013 13:00	29 days,	CE (and the SL6 batch queue behind it) being decommissioned.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
97516	Red	Urgent	Waiting Reply	2013-09-23	2013-09-30	T2K	[SE][StatusOfPutRequest][SRM_REQUEST_INPROGRESS] errors.
97479	Red	Very Urgent	On Hold	2013-09-20	2013-09-30	Atlas	RAL-LCG2, high job failure rate
97385	Red	Less Urgent	On Hold	2013-09-17	2013-09-26	HyperK	CVMFS for hyperk.org
97025	Red	Less urgent	On Hold	2013-09-03	2013-09-12		Myproxy server certificate does not contain hostname
95996	Red	Urgent	On Hold	2013-07-22	2013-09-17	OPS	SHA-2 test failing on lcgce01
91658	Red	Less Urgent	On Hold	2013-02-20	2013-09-03		LFC webdav support
86152	Red	Less Urgent	On Hold	2012-09-17	2013-06-17		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
18/09/13	97.2	100	90.6	93.2	93.0	Castor Stop for DB restart caused test failures across all Vos (except Alice)' Atlas also had a couple of other SRM SUM test failures;
19/09/13	100	100	100	100	100
20/09/13	100	100	100	100	100
21/09/13	100	100	100	100	100
22/09/13	100	100	100	100	100
23/09/13	100	100	100	100	100
24/09/13	100	100	100	100	100
25/09/13	100	100	98.1	100	100	Failed one SUM test during a SRM upgrade then another (Delete) failure overnight.
26/09/13	100	100	100	95.8	100	Single SRM test failure as SRMs were updated.
27/09/13	100	100	100	100	100
28/09/13	100	95.8	100	100	100	Batch problem (Cannot connect to batch server).
29/09/13	100	100	100	100	100
30/09/13	100	100	100	95.7	100	Single SRM test failure on PUT: Error reading token data header:
01/10/13	100	100	74.8	95.8	95.8	Atlas problem affected all sites; Single test failure for LHCb during Janet6 transition; Single test failure for CMS (timeout).

Tier1 Operations Report 2013-10-02

RAL Tier1 Operations Report for 2nd October 2013

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools