Difference between revisions of "RAL Tier1 weekly operations castor 05/04/2019"
From GridPP Wiki
								
												
				 (Created page with "== Standing agenda ==  1. Achievements this week  2. Problems encountered this week  3. What are we planning to do next week?  4. Long-term project updates (if not already cov...")  | 
				|||
| Line 26: | Line 26: | ||
== Achievements this week ==  | == Achievements this week ==  | ||
| − | * Old facd0tl disk servers have been   | + | * Old facd0tl disk servers have been decommissioned  | 
| − | * Facilities headnodes requested on VMWare, ticket not done yet.   | + | ** Proved facilities can recall zero sized file   | 
| − | ** Willing to accept delays on this until ~May  | + | * Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction  | 
| − | ** Queued behind   | + | ** Willing to accept delays on this until ~May  | 
| + | ** Queued behind tape robot and a number of Diamond ICAT tasks  | ||
* Acceptance testing of the new tape robot completed  | * Acceptance testing of the new tape robot completed  | ||
** New-style tape server installation ongoing.  | ** New-style tape server installation ongoing.  | ||
** Tape library for CASTOR-side testing in progress now  | ** Tape library for CASTOR-side testing in progress now  | ||
| − | * Aquilon disk servers ready to go, also queued behind tape robot  | + | *** Large scale read testing completed, seems successful, but analysis underway on a few outstanding queries  | 
| + | * Aquilon disk servers ready to go, also queued behind tape robot  | ||
== Operation problems ==  | == Operation problems ==  | ||
| − | + | * TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)  | |
| − | * TimF was doing a tape verify for Diamond   | + | ** We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance  | 
| − | + | ** Not going to investigate further unless problem reoccurs   | |
| − | ** We would like the test not to run on the read-only Diamond tape server  | + | * ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts  | 
| − | **   | + | ** Tim A has updated the ticket, indicating he will raise the issue with the appropriate people  | 
| − | * ATLAS are periodically submitting   | + | * CASTOR metric reporting for GridPP  | 
| − | **   | + | |
| − | * CASTOR metric reporting for GridPP  | + | |
** Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.  | ** Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.  | ||
| + | * LHCb are trying to access files using the wrong service class (default)  | ||
| + | * IO data traffic on gdss700 stalls, under investigation  | ||
| + | * CEDA outbound certificates going to expire, but Rob H is on the case  | ||
| + | * SQL error during a nsls on the facilities instance  | ||
| + | ** This shows up as an error of 'no file on CASTOR'  | ||
| + | ** No issues seen on Hermes, was there any issue with CASTOR?  | ||
== Plans for next few weeks ==  | == Plans for next few weeks ==  | ||
| − | |||
* Examine further standardisation of CASTOR pool settings.  | * Examine further standardisation of CASTOR pool settings.  | ||
** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.  | ** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.  | ||
| − | * CASTOR side tape robot testing.  | + | * Continue CASTOR side tape robot testing.  | 
== Long-term projects ==  | == Long-term projects ==  | ||
| Line 60: | Line 65: | ||
* CASTOR disk server migration to Aquilon.  | * CASTOR disk server migration to Aquilon.  | ||
** Change ready to implement.  | ** Change ready to implement.  | ||
| + | ** More meaningful stress test needs to be carried out.  | ||
* Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.  | * Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.  | ||
** Ticket with Fabric team to make the VMs.  | ** Ticket with Fabric team to make the VMs.  | ||
| Line 75: | Line 81: | ||
== Staffing ==  | == Staffing ==  | ||
| − | * RA   | + | * RA back next week  | 
== AoB ==  | == AoB ==  | ||
| Line 81: | Line 87: | ||
== On Call ==  | == On Call ==  | ||
| − | + | RA on call  | |
Latest revision as of 09:54, 5 April 2019
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
-  Old facd0tl disk servers have been decommissioned
- Proved facilities can recall zero sized file
 
 -  Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
- Willing to accept delays on this until ~May
 - Queued behind tape robot and a number of Diamond ICAT tasks
 
 -  Acceptance testing of the new tape robot completed
- New-style tape server installation ongoing.
 -  Tape library for CASTOR-side testing in progress now
- Large scale read testing completed, seems successful, but analysis underway on a few outstanding queries
 
 
 - Aquilon disk servers ready to go, also queued behind tape robot
 
Operation problems
-  TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)
- We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance
 - Not going to investigate further unless problem reoccurs
 
 -  ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
- Tim A has updated the ticket, indicating he will raise the issue with the appropriate people
 
 -  CASTOR metric reporting for GridPP
- Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
 
 - LHCb are trying to access files using the wrong service class (default)
 - IO data traffic on gdss700 stalls, under investigation
 - CEDA outbound certificates going to expire, but Rob H is on the case
 -  SQL error during a nsls on the facilities instance
- This shows up as an error of 'no file on CASTOR'
 - No issues seen on Hermes, was there any issue with CASTOR?
 
 
Plans for next few weeks
-  Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
 
 - Continue CASTOR side tape robot testing.
 
Long-term projects
-  New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
 
 -  CASTOR disk server migration to Aquilon.
- Change ready to implement.
 - More meaningful stress test needs to be carried out.
 
 -  Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
 
 - RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
 
Actions
-  AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. 
- Some discussion about what exactly is required and how this can be actually implemented.
 - CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
 
 - RA to look at making all fileclasses have nbcopies >= 1.
 -  Problem with functional test node using a personal proxy which runs out some time in July.
- Rob met with Jens, requested an appropriate certificate.
 
 
Staffing
- RA back next week
 
AoB
On Call
RA on call