Latest revision as of 12:05, 3 April 2013
RAL Tier1 Operations Report for 3rd April 2013
| Review of Issues during the fortnight 20th March to 3rd April 2013.
|
- On Thursday morning, 21st March around 08:40 to 09:00 there was a networking problem that caused some transitory problems for the Tier1. The effect was seen in a single SUM test failure for Atlas.
- Overnight Monday/Tuesday 25/26 March there was a failure of one of the three top BDII nodes. It was removed from the alias the following morning.
- On Tuesday 26th Mar high CPU usage on the site firewall caused intermittent network problems affecting the Tier1 from ~08:00 – ~09:45. Services again experienced transitory failures. There were FTS failures & SAM test failures.
- On Thursday afternoon, around 16:00, an operational error led to a networking break that caused some transitory problems for the Tier1.
- Services ran well over the Easter weekend. There were a couple of problems that did not affect front line services although one of the perfsonar network monitoring systems went down.
| Resolved Disk Server Issues
|
- GDSS446 (AtlasDataDisk D1T0) was taken out of service after reporting FSProbe errors yesterday evening (2nd April). A disk drive in the server has been replaced and it was returned to service around 12:25 today.
| Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
- The problem LHCb and Atlas jobs failing due to long job set-up times remains. A different version of CVMFS has been installed as a test and investigations continue.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
- There is an outstanding problem (and GGUS ticket) affecting the cerificate on the MyProxy server.
| Ongoing Disk Server Issues
|
| Notable Changes made this last week
|
- Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.
- Kernel/errata updates and removal of AFS software (as opposed to just disabling) being done across worker nodes.
- Removal of EMI-1 WMS systems (WMS01,02,03). (Disabled in GOC DB).
- This evening (Wed 3rd March 18:00 - 23:59 BST) Emergency maintenance in Geneva affecting both the main and backup links to CERN. No outage is expected during this maintenance. Services are considered at risk only.
| Advanced warning for other interventions
|
| The following items are being discussed and are still to be formally scheduled and announced.
|
- Next Tuesday (9th April) morning a networking update will cause a couple of short breaks in external connectivity as switches are rebooted.
- One of the disk arrays hosting the LFC/FTS/3D databases has given some errors and an intervention will be necessary requiring a stop of these services. We are checking how long this will take ahead of making an announcement.
- A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).
| Entries in GOC DB starting between 20th March and 3rd April 2013.
|
There were no unscheduled entries in the GOC DB for the last fortnight.
| Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
| lcgwms01, lcgwms02, lcgwms03
|
SCHEDULED
|
OUTAGE
|
22-03-2013 11.00.00
|
15-04-2013 12.00.00
|
24 days
|
EMI-1 WMS service retirement
|
| Open GGUS Tickets (Snapshot at time of meeting)
|
| GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
| 92266
|
Amber
|
Less Urgent
|
In Progress
|
2013-03-06
|
2013-03-28
|
|
Certificate for RAL myproxy server
|
| 91974
|
Red
|
Urgent
|
In Progress
|
2013-03-04
|
2013-04-03
|
|
NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
|
| 91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-04-03
|
|
LFC webdav support
|
| 91029
|
Red
|
Very Urgent
|
On Hold
|
2013-01-30
|
2013-02-27
|
Atlas
|
FTS problem in queryin jobs
|
| 86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-03-19
|
|
correlated packet-loss on perfsonar host
|
| Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
| 20/03/13 |
100 |
100 |
100 |
100 |
100 |
|
| 21/03/13 |
100 |
100 |
98.7 |
100 |
100 |
Single SRM test failure at time of network problem.
|
| 22/03/13 |
100 |
100 |
100 |
100 |
100 |
|
| 23/03/13 |
100 |
100 |
100 |
100 |
100 |
|
| 24/03/13 |
100 |
100 |
100 |
100 |
100 |
|
| 25/03/13 |
100 |
100 |
100 |
100 |
100 |
|
| 26/03/13 |
100 |
100 |
94.2 |
100 |
100 |
Network problem triggered by site firewall overload.
|
| 27/03/13 |
100 |
100 |
100 |
100 |
100 |
|
| 28/03/13 |
100 |
100 |
100 |
100 |
100 |
|
| 29/03/13 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM Put failure (user timeout).
|
| 30/03/13 |
100 |
100 |
100 |
100 |
100 |
|
| 31/03/13 |
100 |
100 |
98.1 |
100 |
100 |
Two consecutive failures of SRM Put test. "could not open connection to srm-atlas.gridpp.rl.ac.uk"
|
| 01/04/13 |
100 |
100 |
100 |
100 |
100 |
|
| 02/04/13 |
100 |
100 |
99.1 |
100 |
100 |
Single SUM test failure of SRM delete.
|