| 
| General updates |  
| 
'Tuesday 29th October
 
 HEPiX takes place this week (timetable). Monday covered T1 reports and some Puppet discussion.
 Rapid job starts at some sites causing overloaded SEs?
 Jes forms will be going out to most sites in the coming day(s) for T2 HW grant applications
 Be aware that "the UK eScience CA infrastructure services on Mon 4th or Tue 5th November" and possibly into 6th will be at risk due to electrical work.
 Minutes from Monday's WLCG ops meeting are available.
 The RAL AFS service will terminate this month.
 Tuesday 22nd October
 
 The move to VO based availability calculations - see files in this parallel reports directory.
 For those attending the WLCG workshop in November - please register as soon as possible.
 Renewing host certificates with 'odd' DNs (see TB-S thread)
 Sites are still being asked to support the backup GridPP VOMS servers and update their status here.
 There have been a series of NGI talks in preparation for H2020 that may be of interest.
 There was an ATLAS request to sites still using the old version of DQ2 clients. Could these sites please upgrade their local installation of DQ2 clients to version 2.4.1? (Email on 17/10/13).
 There has been an EGI ops portal change for VO cernatschool.org with status moving from: New to Production.
 The minos VO has been decommissioned.
 |  
| WLCG Operations Coordination - Agendas |  
| 
Tuesday 29th October
 Tuesday 17th September
 
 The agenda of the next WLCG operations meeting is available here. The details of the agenda are not yet final. The participation of the Tier-1 contacts is being strongly encouraged, but also Tier-2 sites are welcome to listen in and contribute (via Vidyo).
 Tuesday 2nd September
 
 Middleware
 New BDII release in the latest EMI-2/3 update, including better GLUE-2 support and security fixes. Sites should update all their BDII instances
 New CVMFS version released for a security fix. Sites should upgrade or at least apply the hot fix in the above twiki
 perfSONAR: sites should upgrade to the latest version, fixing many deployment problems 
 The end of support for dCache 1.9.12 has been postponed to September 30 due to a delay in releasing the SHA-2 compliant version in the dCache 2.2 series. 
 Consult https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
 SHA-2 
 Discussion mostly dedicated to the experiments testing status. Atlas and LHCb have tested the services but not job submission yet. All experiments have been encouraged to test this.
 SL6
 T2 Done: 49/129 (Alice 11/39, Atlas 28/89, CMS 22/65, LHCb 13/45) -> 80/129 still to be done.
 HS06: Reminder that sites are requested to run HS06 benchmark and update the value in the BDII. Increased values might be discussed at the WLCG MB.
 EMI-3: voms-clients have been fixed and the latest version is in the PT repository but not in EMI-3 yet. Both CMS and Atlas work on DPM/dcache sites with this patch. (QMUL might want to give an update on Storm when they upgrade)
 UK status: Liverpool to be finished soon, Bham in downtime to upgrade this week, Bristol and Sussex should be done by the 15/9/2013, RALPP 20/09/2013 and QMUL, Lancaster, UCL 30/09/2013
 glexec
 55 sites still to respond they have attached the installation to SL6 upgrade.  
 Monday 12th August
 
 There have been no recent meetings. The next is on 29th August.
 
 Sites monitoring requirements: SUM tests not representing the real experiment status for example.   
 |  
| Tier-1 - Status Page |  
| 
Tuesday 29th October
 
 Reminder that the RAL AFS service will be stopped at the end of October.
 We have the current two batch farms running, each with around 50% of the capacity. All worker nodes in the Condor farm have been upgraded to use CVMFS v2.1.15-1. We will continue to run the two farms for another week when the Torque/Maui farm will be decommissioned and its WNs moved to the Condor farm.
 Last Wednesday (23rd Oct) the data uplink from the Tier1 was doubled. There is now has a 20Gbit link to the next router, from where there is the existing 10Gbit link to Janet and the separate 10Gbit OPN link to CERN.
 There is an intervention on the UPS in the computer building next Tuesday/Wednesday 5/6 November. The plan is to keep core services up, although there will need to be some breaks in the LFC & FTS services ate the start and end of the Tuesday (5th). Castor will be down the whole day on Tuesday 5th. Batch services will be down at least for the Tuesday - but this possibly extend to the Wednesday.
 |  
| Storage & Data Management - Agendas/Minutes |  
| 
Tuesday 8th October
 
 The DPM workshop agenda and registration page will appear here.
 Monday 30th September
 
 A DPM workshop is being organised in Edinburgh for 13th December. GridPP PMB anticipated covering travel for of order 10 UK sysadmins for this event. Interest should be indicated during the storage group meeting. 
 Tuesday 17th September
 
 Perhaps someone could summarise the "Dark Data identification tools" thread on TB-Support?
 
 |  
 
| Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06 |  
| 
Tuesday 13th August
 Tuesday 23rd July
 
 Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
 There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.
 Tuesday 30th April
 
 A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).
 Tuesday 12th March
 
 APEL publishing stopped for Lancaster, QMUL and ECDF
 Tuesday 12th February
 
 SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
 An update of the metrics page has been requested. 
 |  
| Documentation - KeyDocs |  
| 
See the worst KeyDocs list for documents needing review now and the names of the responsible people.
 Next review on 7th November.
 Tuesday 1st October
 
 The approved VOs page has been updated with the newest data from the operations portal. Note that the VOMS records for LondonGrid now contain some alternative voms servers. The migration plan for use of these backup servers is now document here.
 Tuesday 17th September 
 Tuesday 3 September 2013
 
 Proposal for "Instant UI", with the aim to produce a suite of documentation and software that will enable a new user to set up a UI and join the grid with the minimum of hassle. Doc will show for to admin a UI that can be used  to submit jobs and retrieve output for a given set of users belonging to a given set of VOs. "Instant UI" is currently in consulation phase with GridPP admin community.
 |  
| Interoperation - EGI ops agendas |  
| 
Monday 28th October
 
 UMD-2 (no news really - support/users dwindling - security support to end by the end of Apr/2014 - bug with BDII; fix coming soon.
 
 ARC - Major release coming in November. 
 
 UMD-3 Cream in test  - Slurm plugin (becoming mainstream?) - also Torque, Blah plugin - Storm and VOMS server and client bug fixes
 
 DMSU bug - affecting retrieval of output file from Cream (EMI-2 and EMI-3 UI affected)
 
 xroot issue for dCache - J. Pina (SA1.3 /LIP): "dcache  2.2.17 does not support xrootd-backport, which is required for running a CMS site on dcache 2.2." 
 
 a new probe for Glue Validator alarms - sites failing it now in this view. See also this document - not clear if list is complete or accurate as status of the probe was not clarified  - complaints from sites about tight schedule due to current effort dedicated to SHA-2 and SL6 - to be decided in November
 
 Next meeting: Nov 1 - changes to timeline? start Jan e possible deadline in 2months. Next meeting: Nov 11.
 Tuesday 15th October
 
 Topics are being gathered for the next EGI ops meeting on Thursday October 24th at 10:00 Amsterdam time.
 David/Raul are representing the UK
 Monday 7th October
 
 The ops meeting today covered: news from URT; staged rollout updates; UMD updates; DMSU updates (WMS problems at CNAF); ARGUS connection problems and SHA-2 update.
 Monday 30th September
 
 There was an EGI ops meeting on 23rd September. See the agenda for more details.
 Monday 16th September
 
 The next meeting takes place on 23rd September at 13:00 (UK time).
 UMD 3.2.0 was released last week. See the release page for more information.
 Monday 2nd September
 
 Yesterday's agenda. Attended by David and Raul.
 gLite support calendar.
 
 |  
| Monitoring - Links MyWLCG |  
| 
Monday 30th September
 
 David summarised the UK site's position on Nagios in an email last week as:
 There is a desire for a monitoring solution that gave automatic notifications and links to further information, and didn't require additional webpages (which describes Nagios). We noted that Nagios could be used to import central nagios tests and repurposing them for local testing.
 In addition, it would be useful if the further details could include details of the testing execution commands (even including the test itself) for local diagnosis.
 We wondered whether (and where) there might be common ground with the WLCG Nagios project - while this may have been discussed, it would be useful to clarify this.
 It's important to have a clear and documented messaging/transport layer for any solution that's decided on, for integration with future monitoring solutions.
 Tuesday 23rd July
 Tuesday 18th June
 
 David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
 
 Glasgow dashboard now packaged and can be downloaded here. 
 |  
| On-duty - Dashboard ROD rota |  
| 
Tuesday 22nd October
 
 There seem to be a number of sites struggling to publish, but there already seem to be quite a number of GGUS tickets out there.
 Tuesday 15th October
 
 Just two sites (RAL Tier1 and ECDF) not in downtime and with SHA-2 alarms.  
 RALPP has a ticket for the dCache version which they plan 
 Some alarms earlier in the week due to BDII problems
 A ticket against Sussex has hit the one month resolution deadline and been escalated. This has also resulted in a "ROD not performing properly" ticket to the UK NGI.
 The rota for coming months needs to be agreed.
 |  
| Rollout Status WLCG Baseline |  
| 
Tuesday 29th Oct
Yesterday the first stage rollout request (for the CREAMCE) in months has come through.
I've updated the Stage of the Nation page.
 Tuesday 8th Oct
There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout.
There is a problem with dcap-libs: [GGUS 97805] 
 Tuesday 17th September
 
 Chris sent in a report for Storm. 
 References
 
 |  
| Security - Incident Procedure Policies Rota |  
| 
Tuesday 29th October
 
 There was a team meeting on Friday 25th.
 A couple of critical warnings are appearing in Pakiti and being followed up.
 Tuesday 8th October
 
 ARGUS setup for UK
 ARGUS configuration (see Chris's email)
 Tuesday 17th September
 
 More information on the EGI/PRACE/EUDAT Joint Security Training event mentioned last week is now available.
 
 |  
 | 
| Services - PerfSonar dashboard | GridPP VOMS |  
| 
Tuesday 1st October
 
 PerfSONAR latency hosts configured to use the WLCG meshes should now have a traceroute measurement achive (MA) accessible from the GUI under 'Service Graphs' --> 'Traceroute'. Here is an example.
 Tuesday 17th September
 
 Upgrading/re-installing hosts to v3.3.1/mesh is only making slow progress.
 There is a new view of the status between sites.
 An outage at Manchester due to central switch maintenance means that VOMS is not going to be contactable for a period this morning. It is clear that we need the backup VOMS instances fully available to VOs - please can someone take a lead?
 |  
| Tickets |  
| 
Tuesday 29th October 2013, 10.30 GMT</br>
 I was off the clock yesterday, so forgot to send out the ticket update. It's a bit late now, but here's the ones that catch my eye:
 NGI/SUSSEX</br>
https://ggus.eu/ws/ticket_info.php?ticket=97941
The ticket to the NGI over Sussex ticket 97139 needs some soothing, especially considering that the offending ticket is now closed. In progress (21/10).
 RAL</br>
https://ggus.eu/ws/ticket_info.php?ticket=97868</br>
Catalin has asked some questions to T2K about what/how much they're going to put into their pending cvmfs area. No reply from T2K yet. Waiting for reply (21/10)
 I notice a bunch of minor updates to the gLEXec tickets, the generally theme is that things are progressing but sorting out the SL6 bugbears has took priority. Also I see a few more LHCB tickets this last week or so, probably as they run into problems with new SL6 queues. I've always found LHCB jobs to be excellent canaries (although I don't think they'd appreciate being thought of as such - sorry guys!).
 
 
 |  
| Tools - MyEGI Nagios |  
| 
Monday 30th September
 
 Ewan has put together a slightly modified WLCG VO box, but the effect is of a UI that takes gsi ssh logins from people in one particular VO, but then can be used as a UI for other VOs once you're logged in. The idea is that anyone who would need access to a central UI machine (so, mostly not people in PP depts.) would join a special-purpose VO. See Ewan's TB-SUPPORT email on 23rd September for more details.
 Monday 2nd September
 
 Intermittent Nagios errors -> Imperial WMS and all the jobs going through it were failing with ‘no compatible error’. Some reports of ongoing issues. What is the direct impact?
 MyEGI and gstat were also down last week.
 Jens is testing SHA-2 compliance of components. The version of gridsite on the GridPP website is not compliant but SHA-2 will be supported with a move to a new server (when?).
 |  
| VOs - GridPP VOMS VO IDs Approved VO table |  
| 
Monday 21st October 2013
 Monday 7th October 2013
 
 CVMFS server for hyperk.org still outstanding
 LFC Webdav still awaiting port opening
 HyperK - progress - expect to run significant number of jobs soon. 
 Monday 2nd September
 
 The next quarterly Tier 1 allocation/resourcing meeting is scheduled for Wednesday 18th September (after the weekly T1 meeting)  the hardware requirements and fair-shares for the period October-December 2013 will be reviewed. It looks ahead over the next 12 month timeframe. Can all experiments/projects please let Pete G have any updates or requests to these numbers by Friday 13th September please?
 Monday 19 August
 
 EPIC 
 Support requested at Tier-1
 Any other sites prepared to support them?
 
 Catalogue synchronisation - Biomed working on it.  
 Monday 12 August
 
 HyperK.org
 VOMS servers set up (Manchester, Oxford, Imperial)
 VOID card - stalled on a homepage. 
 WMS set up (Imperial) - awaiting Glasgow, Ral
 Site set up (QMUL)
 LFC - in progress
 CVMFS - considering
 
 SNO+
 Dirac set up for some CEs
 
 ngs.ac.uk VO - any reason  to keep it?
 
 Software areas for SL6 
 Are we keeping the same areas as sl5?
 What about the software tags?
 Push CVMFS?
 |  |