Difference between revisions of "RAL Tier1 weekly operations castor 22/03/2019"
From GridPP Wiki
								
												
				 (→Achievements this week)  | 
				 (→Achievements this week)  | 
				||
| Line 27: | Line 27: | ||
* New facd0t1 disk servers  | * New facd0t1 disk servers  | ||
| − | ** All new facd0t1 disk servers   | + | ** All new facd0t1 disk servers are in production  | 
** We will then retire the old servers    | ** We will then retire the old servers    | ||
* Facilities headnodes requested on VMWare, ticket not done yet.  | * Facilities headnodes requested on VMWare, ticket not done yet.  | ||
Revision as of 10:25, 22 March 2019
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
-  New facd0t1 disk servers
- All new facd0t1 disk servers are in production
 - We will then retire the old servers
 
 -  Facilities headnodes requested on VMWare, ticket not done yet.
- Willing to accept delays on this until ~May.
 - Queued behind new disk, tape robot and a number of Diamond ICAT tasks.
 
 -  Acceptance testing of the new tape robot completed
- New-style tape server installation ongoing.
 - Tape library ready for CASTOR-side testing
 
 - Aquilon disk servers ready to go, also queued behind tape robot.
 
Operation problems
-  ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
- Rob has created a ticket with Tim.
 
 -  CASTOR metric reporting for GridPP.
- Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
 
 - lcgsrm10.gridpp.rl.ac.uk (LHCb) failed and was dropped out the alias. It will (probably) not be fixed.
 -  castor-stager01.gridpp.rl.ac.uk went read-only on Tuesday evening due to a hypervisor load issue. According to Fabric this is a known issue.
- A mitigation measure has been put into place (turning a high-load box on the same HV into a physical host)
 
 
Plans for next few weeks
-  Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
 
 - New Facilities disk servers
 - Tape robot testing.
 
Long-term projects
-  New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
 
 -  CASTOR disk server migration to Aquilon.
- Change ready to implement.
 
 -  Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
 
 - RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
 - Bellona (new Facilities DB) migration - monitoring fixed.
 
Actions
-  AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. 
- Some discussion about what exactly is required and how this can be actually implemented.
 - CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
 
 - RA to look at making all fileclasses have nbcopies >= 1.
 -  Problem with functional test node using a personal proxy which runs out some time in July.
- Rob met with Jens, requested an appropriate certificate.
 
 
Staffing
- RA out from Friday for two weeks.
 - GP out on Monday
 
AoB
On Call
GP on call