RAL Tier1 Operations Report for 23rd December 2015
| Review of Issues during the fortnight 9th to 23rd December 2015.
|
- On Thursday 10th December there was a significant problem on the Tier1 network. A packet storm was followed by the Tier1 network being disconnected from the site network. The trigger appears to have been the restarting of a particular switch. The details of this are not yet understood.
- There was a problem with the recall from tape of a large number of files for LHCb over the weekend of 11/12/13 Dec. This was caused by some poor performance of at least one of the disk servers in the disk cache and a parameter introduced in the Castor 2.1.15 tape servers that delayed the reporting of when files were read from tape.
| Resolved Disk Server Issues
|
- GDSS675 (CMSTape - D0T1) failed in the early hours of 8th Dec. On investigation found there were two disks that had failed. Returned to production on 11th December.
- GDSS620 (GenTape - D0T1) also failed during the early morning of 8th Dec. Also returned to production on the 11th Dec. No cause found.
- GDSS689 (AtlasDataDisk - D1T0) was taken out of production on the 10th Dec. when a second disk failed during the rebuilding of a first. Returned to production after the first disk rebuild had completed on Tuesday 15th Dec.
- GDSS654 (LHCbRawRDst - D0T1) was taken out of service on the 11th Dec. Again a precaution as a double disk failure. Returned to production on the 13th Dec.
- GDSS617 (AliceDisk -D1T0) Crashed on the 15th Dec. After tests no underlying cause found. Returned to production on the 17th Dec.
- GDSS710 (CMSDisk - D1T0) Taken out of production for a short while for a reboot as the RAID card was not seeing a replacement disk.
- GDSS686 (AtlasDataDisk - D1T0) Taken out of production on the 21st Dec. as there was double disk failure. Returned to production on the 22nd after the first disk had rebuilt.
| Current operational status and issues
|
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
- The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
- There is a problem reported by LHCb of a high rate of batch job failures since around the 9th December. The cause is not yet known.
| Ongoing Disk Server Issues
|
- GDSS665 (AtlasTape) failed on the 21st Dec. It was rebooted and all canbemigr files migrated to tape. It is undergoing tests.
- GDSS620 (GenTape - D0T1) failed again (see above) on the 22nd Dec. It was rebooted and all canbemigr files migrated to tape. It is undergoing tests.
- GDSS656 (lhcbRawRdst - D0T1) has a double disk failure on the 23 Dec. It has been removed from service while the disks are replaced and RAID rebuilt.
| Notable Changes made since the last meeting.
|
- The final steps have been taken for the removal of the old core network switch which has now been taken off the network.
- A board has replaced in the UKLight router and another added. Following this the link between the UKLight router and the RAL border router was doubled from a single to a pair of 10Gbit connections. Thus doubling our data bandwidth over this route.
- In order to ease problems with tape recalls servers in the LHCbRawRDst service class were converted to use the Linux NOOP IO scheduler.
None
| Advanced warning for other interventions
|
| The following items are being discussed and are still to be formally scheduled and announced.
|
- Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update disk servers in tape-backed service classes to SL6 (ongoing)
- Update to Castor version 2.1.15.
- Networking:
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, LFC)
| Entries in GOC DB starting since the last report.
|
| Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
| All Castor (All SRMs)
|
UNSCHEDULED
|
WARNING
|
15/12/2015 08:00
|
15/12/2015 09:00
|
1 hour
|
Warning on data transfers to/from site during board swap in network router.
|
| Whole Site
|
UNSCHEDULED
|
OUTAGE
|
10/12/2015 11:45
|
10/12/2015 15:15
|
3 hours and 30 minutes
|
Internal network issues at RAL Tier1 (retrospective)
|
| All Castor (All SRMs)
|
SCHEDULED
|
WARNING
|
09/12/2015 09:30
|
09/12/2015 10:30
|
1 hour
|
Warning on Castor services for short network reconfiguration.
|
| Open GGUS Tickets (Snapshot during morning of meeting)
|
| GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
| 118345
|
Green
|
Very Urgent
|
Waiting for Reply
|
2015-12-14
|
2015-12-22
|
|
Usage of your WMS
|
| 118209
|
Green
|
Less Urgent
|
In Progress
|
2015-12-15
|
2015-12-18
|
|
Enabling CVMFS for the vo.neugrid.eu VO
|
| 118044
|
Green
|
Less Urgent
|
In Progress
|
2015-11-30
|
2015-12-16
|
Atlas
|
gLExec hammercloud jobs failing at RAL-LCG2 since October
|
| 117846
|
Green
|
Urgent
|
Waiting for Reply
|
2015-11-23
|
2015-12-22
|
Atlas
|
ATLAS request- storage consistency checks
|
| 117683
|
Green
|
Less Urgent
|
In Progress
|
2015-11-18
|
2015-11-19
|
|
CASTOR at RAL not publishing GLUE 2
|
| 116866
|
Amber
|
Less Urgent
|
On Hold
|
2015-10-12
|
2015-12-18
|
SNO+
|
snoplus support at RAL-LCG2 (pilot role)
|
| 116864
|
Red
|
Urgent
|
In Progress
|
2015-10-12
|
2015-12-16
|
CMS
|
T1_UK_RAL AAA opening and reading test failing again...
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
| Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
| 09/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
| 10/12/15 |
98.5 |
100 |
97 |
92 |
85 |
93 |
100 |
Problems on Tier1 network led to test failures.
|
| 11/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
| 12/12/15 |
100 |
100 |
100 |
100 |
96 |
100 |
100 |
Single SRM test failure as the test was not able to copy a file in
|
| 13/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
| 14/12/15 |
100 |
100 |
98 |
98 |
100 |
100 |
96 |
Atlas: Single SRM test failure (timeout) on GET. CMS: The org.sam.CONDOR-JobSubmit test on all ARC-CEs “Unspecified gridmanager error”“
|
| 15/12/15 |
100 |
100 |
100 |
99 |
100 |
100 |
100 |
Continuation of above.
|
| 16/12/15 |
100 |
100 |
98 |
96 |
100 |
98 |
100 |
Atlas: Single SRM Test failure on PUT. (Timeout); CMS: Single SRM Test failure ‘File was NOT copied to SRM’
|
| 17/12/15 |
100 |
100 |
100 |
96 |
100 |
98 |
100 |
Single SRM test failure on PUT: Input/output error
|
| 18/12/15 |
100 |
100 |
100 |
100 |
100 |
96 |
95 |
|
| 19/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
| 20/12/15 |
100 |
100 |
100 |
100 |
96 |
100 |
100 |
Single SRM test failure on list: [SRM_INVALID_PATH] No such file or directory
|
| 21/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
| 22/12/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|