RAL Tier1 Operations Report for 3rd September 2014
| Review of Issues during the week 27th August to 3rd September 2014.
|
- A network switch failed overnight Friday-Saturday (29/30 Aug). Staff attended on site and the immediate problem was resolved. However, further problems were found with a number of VMs providing services that took some time to fix. Not all services were affected - the site (except Castor) was declared down for around 6 hours on Saturday.
| Resolved Disk Server Issues
|
- GDSS748 (AtlasDataDisk - D1T0) was found to be unresponsive in the early morning of Thursday (28th Aug). It failed to restart after a reboot. A failed disk was found and replaced. The system was returned to service later that day.
| Current operational status and issues
|
- Discrepancies were found in some of the Castor database tables and columns. The Castor team are considering options with regard to fixing these. The issue has no operational impact.
- We are still investigating xroot access to CMS Castor following the upgrade on the 17th June. The service has improved but there may still be work to be done.
| Ongoing Disk Server Issues
|
- GDSS659 (AtlasDataDisk - D1T0) has had a number of problems. The server initially failed on Thursday (28th Aug). It was returned to service the following day but failed again over the weekend - a problem only found on Monday morning. Following further RAID disk rebuild it was returned to service yesterday morning (Tuesday 2nd Aug). The server again stopped serving files at around 05:30 this morning. The server is now being drained (during which time it does serve files).
| Notable Changes made this last week.
|
- The FTS2 service was ended yesterday, 2nd September. The servers were shutdown.
- The Software Server that was used by the smaller VOs has been stopped.
| Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
| lcgfts.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
02/09/2014 11:00
|
02/10/2014 11:00
|
30 days,
|
Service being decommissioned.
|
| Advanced warning for other interventions
|
| The following items are being discussed and are still to be formally scheduled and announced.
|
- The rollout of the RIP protocol to the Tier1 routers still has to be completed.
- Access to the Cream CEs will be withdrawn apart from leaving access for ALICE. The proposed date for this is Tuesday 23rd September.
Listing by category:
- Databases:
- Apply latest Oracle patches (PSU) to the production database systems (Castor, Atlas3D).
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers.
- Fabric
- Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes; migration of GEN from 'A' to 'D' tapes.)
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room - date to be decided.
| Entries in GOC DB starting between the 27th August and 3rd September 2014.
|
| Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
| cream-ce01.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
02/09/2014 15:01
|
02/09/2014 16:16
|
1 hour and 15 minutes
|
draining before re-configuration
|
| lcgfts.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
02/09/2014 11:00
|
02/10/2014 11:00
|
30 days,
|
Service being decommissioned.
|
| All services except Castor
|
UNSCHEDULED
|
WARNING
|
30/08/2014 14:00
|
01/09/2014 09:43
|
1 day, 19 hours and 43 minutes
|
WARNING following network problems on virtual machine
|
| All services except Castor
|
UNSCHEDULED
|
OUTAGE
|
30/08/2014 09:00
|
30/08/2014 14:24
|
5 hours and 24 minutes
|
Putting all services except CASTOR into downtime while we investigate network related problems on the HyperV systems
|
| Open GGUS Tickets (Snapshot during morning of meeting)
|
| GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
| 107935
|
Green
|
Less Urgent
|
In Progress
|
2014-08-27
|
2014-09-02
|
Atlas
|
BDII vs SRM inconsistent storage capacity numbers
|
| 107880
|
Green
|
Less Urgent
|
In Progress
|
2014-08-26
|
2014-09-02
|
SNO+
|
srmcp failure
|
| 106324
|
Red
|
Urgent
|
On Hold
|
2014-06-18
|
2014-08-14
|
CMS
|
pilots losing network connections at T1_UK_RAL
|
| 105405
|
Red
|
Urgent
|
On Hold
|
2014-05-14
|
2014-07-29
|
|
Please check your Vidyo router firewall configuration
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
| Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
| 27/08/14 |
100 |
100 |
100 |
100 |
100 |
98 |
100 |
|
| 28/08/14 |
100 |
100 |
100 |
100 |
100 |
100 |
97 |
|
| 29/08/14 |
100 |
100 |
100 |
100 |
100 |
100 |
97 |
|
| 30/08/14 |
100 |
100 |
99.4 |
94.3 |
100 |
100 |
98 |
A network switch failed. This was worked around but the VM infrastructure exhibited some network problems too.
|
| 31/08/14 |
100 |
100 |
100 |
100 |
100 |
94 |
n/a |
|
| 01/09/14 |
100 |
100 |
100 |
100 |
100 |
100 |
96 |
|
| 02/09/14 |
100 |
100 |
100 |
100 |
100 |
96 |
96 |
|