Latest revision as of 12:05, 3 April 2013

RAL Tier1 Operations Report for 3rd April 2013

Review of Issues during the fortnight 20th March to 3rd April 2013.

On Thursday morning, 21st March around 08:40 to 09:00 there was a networking problem that caused some transitory problems for the Tier1. The effect was seen in a single SUM test failure for Atlas.
Overnight Monday/Tuesday 25/26 March there was a failure of one of the three top BDII nodes. It was removed from the alias the following morning.
On Tuesday 26th Mar high CPU usage on the site firewall caused intermittent network problems affecting the Tier1 from ~08:00 – ~09:45. Services again experienced transitory failures. There were FTS failures & SAM test failures.
On Thursday afternoon, around 16:00, an operational error led to a networking break that caused some transitory problems for the Tier1.
Services ran well over the Easter weekend. There were a couple of problems that did not affect front line services although one of the perfsonar network monitoring systems went down.

Resolved Disk Server Issues

GDSS446 (AtlasDataDisk D1T0) was taken out of service after reporting FSProbe errors yesterday evening (2nd April). A disk drive in the server has been replaced and it was returned to service around 12:25 today.

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
The problem LHCb and Atlas jobs failing due to long job set-up times remains. A different version of CVMFS has been installed as a test and investigations continue.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
There is an outstanding problem (and GGUS ticket) affecting the cerificate on the MyProxy server.

Ongoing Disk Server Issues

Notable Changes made this last week

Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.
Kernel/errata updates and removal of AFS software (as opposed to just disabling) being done across worker nodes.
Removal of EMI-1 WMS systems (WMS01,02,03). (Disabled in GOC DB).

Declared in the GOC DB

This evening (Wed 3rd March 18:00 - 23:59 BST) Emergency maintenance in Geneva affecting both the main and backup links to CERN. No outage is expected during this maintenance. Services are considered at risk only.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Next Tuesday (9th April) morning a networking update will cause a couple of short breaks in external connectivity as switches are rebooted.
One of the disk arrays hosting the LFC/FTS/3D databases has given some errors and an intervention will be necessary requiring a stop of these services. We are checking how long this will take ahead of making an announcement.
A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).

Entries in GOC DB starting between 20th March and 3rd April 2013.

There were no unscheduled entries in the GOC DB for the last fortnight.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgwms01, lcgwms02, lcgwms03	SCHEDULED	OUTAGE	22-03-2013 11.00.00	15-04-2013 12.00.00	24 days	EMI-1 WMS service retirement

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
92266	Amber	Less Urgent	In Progress	2013-03-06	2013-03-28		Certificate for RAL myproxy server
91974	Red	Urgent	In Progress	2013-03-04	2013-04-03		NAGIOS eu.egi.sec.EMI-1 failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
91658	Red	Less Urgent	On Hold	2013-02-20	2013-04-03		LFC webdav support
91029	Red	Very Urgent	On Hold	2013-01-30	2013-02-27	Atlas	FTS problem in queryin jobs
86152	Red	Less Urgent	On Hold	2012-09-17	2013-03-19		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
20/03/13	100	100	100	100	100
21/03/13	100	100	98.7	100	100	Single SRM test failure at time of network problem.
22/03/13	100	100	100	100	100
23/03/13	100	100	100	100	100
24/03/13	100	100	100	100	100
25/03/13	100	100	100	100	100
26/03/13	100	100	94.2	100	100	Network problem triggered by site firewall overload.
27/03/13	100	100	100	100	100
28/03/13	100	100	100	100	100
29/03/13	100	100	100	95.9	100	Single SRM Put failure (user timeout).
30/03/13	100	100	100	100	100
31/03/13	100	100	98.1	100	100	Two consecutive failures of SRM Put test. "could not open connection to srm-atlas.gridpp.rl.ac.uk"
01/04/13	100	100	100	100	100
02/04/13	100	100	99.1	100	100	Single SUM test failure of SRM delete.