RAL Tier1 weekly operations castor 13/05/2016
From GridPP Wiki
								
												
				Contents
Operations News
- New MICE user set up
Operations Problems
- aircon issue - reduced impact by stopping the batch farm. There was a question as to how batch is turned back on, concerns swamping castor?
- tape library issues
- gfal investigations - awaiting membership of dteam etc for George P
- draining - George has training on manual draining technique (atlas)
- gfalcat does not work with castor, underlying issue fixed for gfalcopy but not gfalcat (gfal developers responsible) - Tracking
- AtlasScratch, users from atlas still having problems accessing atlasScratch files - investigations ongoing
- GDSS771 crashed - now in draining
- draining is not working for atlas (does however seem to work on LHCb) - Brian has changed parameters as recommended by Shaun no improvement. manual method of draining still works - diskServerLs and stager_get (to move file to another disk server)
Planned, Scheduled and Cancelled Interventions
- host certs on many disk servers should be updated (gridftp relies on this)
- CASTOR 2.1.15 - issue with writes
Long-term projects
- SL7 castor - disk servers higher priority (frontend to CEPH) this will be Aquilon based
- RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
- JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
- WAN tuning
Advanced Planning
Tasks
- CASTOR 2.1.15 implementation and testing
- Deployment of SRM 2.14
Staffing
- Chris out Friday and following Monday
- Castor on Call person next week
RA for next 2 weeks
New Actions
Existing Actions
- GS ask Kashif re RAID firmware updates on d0t1 v2011 machines and if there are other batches of machines that should upgraded
- GP to work with BD to take over WAN tuning work developed by BD (aquilon / SCDB)
- GP to create wan tuning WIKI
- RA/AS new tool for monitoring srm db dups - the user type
- RA to get someone to code review his SRM_DB_DUPLICATES blatting script
- GS is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
- GS Callout for CIP only in waking hours?
- RA ensure quattorising atlas consistency check - Rob to talk to Andrew L
- RA to try stopping tapeserverd mid-migration to see if it breaks - ask Tim.
- RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
- GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress
Completed Actions
- BD check if D drives have arrived for WLCG
- BD report draining issues to CERN
- BD mice ticket - asking for a separate tape pool for d0t1 for monticarlo
