Difference between revisions of "Tier1 Operations Report 2018-08-20"

Latest revision as of 13:24, 22 August 2018

RAL Tier1 Operations Report for 20th August 2018

Review of Issues during the week 13th August to the 20th August 2018.

The upgrade of Echo was completed successfully on Thursday (16/8/18), with a greatly reduced memory usage. The cluster was allowed to recover overnight. Everything appeared to be working well on the Friday (17/8/18), and there is currently no evidence of data loss. We therefore ended the downtime at Friday 12:00 UTC (17/8/18). As a precaution for the weekend, we limited the ATLAS (and CMS), quota on our batch farm to 50% of its nominal amount. Assuming we encounter no problems we intend to lift this on Monday (20/8/18).

Current operational status and issues

The new siny O2 SIM for the SMS service has been delivered and installed

Resolved Castor Disk Server Issues

Machine	VO	DiskPool	dxtx	Comments
-	-	-	-	-

Ongoing Castor Disk Server Issues

Machine	VO	DiskPool	dxtx	Comments
gdss747	Atlas	atlasStripInput	d1t0	Currently in intervention.

Limits on concurrent batch system jobs.

GROUP_CMS_LIMIT = 4000
GROUP_ATLAS_LIMIT = 8000

Notable Changes made since the last meeting.

None.

Entries in GOC DB starting since the last report.

Service	ID	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
-	-	-	-	-	-	-	-

Declared in the GOC DB

Service	ID	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
-	-	-	-	-	-	-	-

No ongoing downtime
No downtime scheduled in the GOCDB for next 2 weeks

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Castor:
- Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
- Move to generic Castor headnodes.
Internal
- DNS servers will be rolled out within the Tier1 network.

Open

GGUS Tickets (Snapshot taken during morning of the meeting).

Request id	Affected vo	Status	Priority	Date of creation	Last update	Type of problem	Subject	Scope
136757	mice	in progress	less urgent	17/08/2018	21/08/2018	Other	Missing lsc files for mice VO on lfc.gridpp.rl.ac.uk ?	EGI
136701	lhcb	in progress	very urgent	14/08/2018	21/08/2018	File Transfer	background of transfer errors	WLCG
136366	mice	in progress	less urgent	25/07/2018	20/08/2018	Local Batch System	Remove MICE Queue from RAL T1 Batch	EGI
136199	lhcb	in progress	very urgent	18/07/2018	07/08/2018	File Transfer	Lots of submitted transfers on RAL FTS	WLCG
136028	cms	in progress	top priority	10/07/2018	21/08/2018	CMS_AAA WAN Access	Issues reading files at T1_UK_RAL_Disk	WLCG
124876	ops	in progress	less urgent	07/11/2016	23/07/2018	Operations	[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk	EGI

GGUS Tickets Closed Last week

Request id	Affected vo	Status	Priority	Date of creation	Last update	Type of problem	Subject	Scope
136655	lhcb	verified	less urgent	10/08/2018	15/08/2018	File Access	Missing File At RAL	WLCG
136460	cms	closed	urgent	30/07/2018	15/08/2018	CMS_Data Transfers	Transfers failing to RAL_Buffer	WLCG
136427	atlas	closed	urgent	28/07/2018	13/08/2018	File Transfer	UK RAL-LCG2: Transfer errors as destination	WLCG
136408	cms	closed	urgent	27/07/2018	15/08/2018	CMS_Data Transfers	missing files at RAL	WLCG

Availability Report

Target Availability for each site is 97.0%

Red <90%

Orange <97%

Day	Atlas	Atlas-Echo	CMS	LHCB	Alice	OPS
2018-08-13	100	100	0	100	100	100
2018-08-14	100	100	0	100	100	100
2018-08-15	100	100	0	100	100	100
2018-08-16	100	100	0	100	100	100
2018-08-17	100	100	60	100	100	100
2018-08-18	100	100	100	100	100	100
2018-08-19	100	100	100	100	100	100
2018-08-20	100	100	100	100	100	100

Hammercloud Test Report

Target Availability for each site is 97.0%

Red <90%

Orange <97%

Day	Atlas HC	CMS HC
2018-08-13	0	0
2018-08-14	0	0
2018-08-15	0	0
2018-08-16	0	0
2018-08-17	76	60
2018-08-18	100	100
2018-08-19	100	100
2018-08-20	100	100

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.

The recent problems with Echo were discussed. A summary is also in the Operations report. Several points:
Current situation is that we have “production” access running
Atlas and CMS have upper limits on batch jobs about equal to pledge.
The additional memory for the Dell storage nodes has arrived. This will be added – at least initially in a rolling upgrade. However, it is expected this may take up to something like 2 weeks. (At which point we expect to be able to give full access to Echo). Discussions on the best way to do the memory upgrades is ongoing.
Before the problem we had noted that the LHCb files came from Castor in a “bursty” way. It was suggested last week that we limit the FTS to smooth out this burstiness.
When discussing GGUS tickets: LHCb are seeing some file transfer failures between worker nodes and Castor. This to be escalated.
The requirement for us to be able to easily contact all users of Echo in the event of a problem was noted.

@@ Line 81: / Line 81: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.
 |}
-GROUP_CMS_LIMIT = 4000
+* GROUP_CMS_LIMIT = 4000
-GROUP_ATLAS_LIMIT = 8000
+* GROUP_ATLAS_LIMIT = 8000
 <!-- ******************End Limits On Batch System Jobs***************** ----->
 <!-- ****************************************************************** ----->
@@ Line 472: / Line 472: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notes from Meeting.
 |}
-*
+* The recent problems with Echo were discussed. A summary is also in the Operations report. Several points:
+* Current situation is that we have “production” access running
+* Atlas and CMS have upper limits on batch jobs about equal to pledge.
+* The additional memory for the Dell storage nodes has arrived. This will be added – at least initially in a rolling upgrade. However, it is expected this may take up to something like 2 weeks. (At which point we expect to be able to give full access to Echo).  Discussions on the best way to do the memory upgrades is ongoing.
+* Before the problem we had noted that the LHCb files came from Castor in a “bursty” way. It was suggested last week that we limit the FTS to smooth out this burstiness.
+* When discussing GGUS tickets: LHCb are seeing some file transfer failures between worker nodes and Castor. This to be escalated.
+* The requirement for us to be able to easily contact all users of Echo in the event of a problem was noted.

Difference between revisions of "Tier1 Operations Report 2018-08-20"

Latest revision as of 13:24, 22 August 2018

RAL Tier1 Operations Report for 20th August 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools