RAL Tier1 weekly operations castor 19/10/2009
From GridPP Wiki
								
												
				Contents
Summary of Previous Week
- CASTOR F2F at CERN (Chris, Matt)
-  Continuing to deal with fallout from ORACLE disk contoller crash: specificially the rollback of the databases (All)
- Investigation into exactly what happened (All, DB Team)
- Investigating into the consequences of re-using NS uniqueid (Chris, Matt, CERN team)
- Producing lists of lost and at-risk files (Chris, Matt)
- Gathering information for post mortem (All)
- Increased NS uniqueid counter in NS database (All, DB Team)
 
- Deployed one new disk server for LHCb (Chris)
- Tweaked database backups to try out a grandfather/father/son cycle (Cheney)
- Continued with build of new db server cdbe07 (Cheney)
- Tweaked backups of redo logs to dmf for Pluto (Cheney)
- Added bulk log disk array for Pluto redo log archive (Cheney)
- Fixed cdbe02 and configured to pick up Overland array (Cheney)
- Shifted emc array to run on different pdu but same power supply (Cheney)
- Building tape robot controller to swapout buxton (Cheney)
Developments for this week
- Setup 2.1.8 on repack server with Puppet (Chris)
- Working on puppet manifest for polymorphic central servers (Chris)
- Testing various combinations of emc kit versus power supply (Cheney)
- Regen nagios config for diskservers (Cheney)
- Build spare tape robot controller (Cheney)
- Build replacement db server (Cheney)
- Techwatch newsletter (Cheney)
- Making ATLAS file lists for comparison to LFC (Matt)
- Contributing to incident PMs (Matt)
Ongoing
- SRM 2.8-1 deployment on Gen,LHCb,CMS (Shaun)
- CastorMon monitoring graphs for Gen instance (Brian)
- Black and White list tests (Chris)
- Disaster recovery document (Matt)
Operations Issues
- Possible lost data resulting from reusing NS uniqueid's (TBC).
- Problems with DNS server (chiton) caused all CASTOR instances to be affected for 4-5 hours
Blocking issues
- Problems with ganglia check on GEN instance delaying work on monitoring (in hand)
Planned, Scheduled and Cancelled Down Times
none
Changes to Production Milestones
none
Advanced Planning
- Black and White lists? (delayed until it is required on a 'per-instance' basis)
- Improve resiliency to central services (This year)
Staffing
- Brian A/L
- Tim at LTUG (Mon-Wed)
- Shaun away (?)
- Castor on Call person: Chris
