RAL Tier1 Workload Management
Important note: As from 1st December 2008, there is no more lcg-RB service offered by RAL Tier1 (lcgrb01, lcgrb02 and lcgrb03.gridpp.rl.ac.uk servers have been decommissioned). The information below is out of date and will be replaced by proper glite-WMSLB related information early January 2009. Users may want however to submit job to lcgwms01 and lcgwms02.gridpp.rl.ac.uk using glite-wms-job-* tools.
Contents
Service Endpoints
The RAL Tier1 runs a LCG Workload Management System or Resource Broker on three machines: lcgrb01.gridpp.rl.ac.uk, lcgrb02.gridpp.rl.ac.uk and lcgrb03.gridpp.rl.ac.uk
A list of VOs that the RBs support can be found from
ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 \
-b 'Mds-vo-name=RAL-LCG2,o=Grid' '(GlueServiceType=ResourceBroker)' \
GlueServiceAccessControlRule
As on 15 October 2008
GlueServiceAccessControlRule: atlas GlueServiceAccessControlRule: alice GlueServiceAccessControlRule: lhcb GlueServiceAccessControlRule: cms GlueServiceAccessControlRule: biomed GlueServiceAccessControlRule: zeus GlueServiceAccessControlRule: hone GlueServiceAccessControlRule: cdf GlueServiceAccessControlRule: dzero GlueServiceAccessControlRule: babar GlueServiceAccessControlRule: pheno GlueServiceAccessControlRule: t2k GlueServiceAccessControlRule: esr GlueServiceAccessControlRule: ilc GlueServiceAccessControlRule: magic GlueServiceAccessControlRule: minos.vo.gridpp.ac.uk GlueServiceAccessControlRule: mice GlueServiceAccessControlRule: dteam GlueServiceAccessControlRule: fusion GlueServiceAccessControlRule: geant4 GlueServiceAccessControlRule: cedar GlueServiceAccessControlRule: manmace GlueServiceAccessControlRule: gridpp GlueServiceAccessControlRule: ngs.ac.uk GlueServiceAccessControlRule: camont GlueServiceAccessControlRule: totalep GlueServiceAccessControlRule: vo.southgrid.ac.uk GlueServiceAccessControlRule: vo.northgrid.ac.uk GlueServiceAccessControlRule: vo.scotgrid.ac.uk GlueServiceAccessControlRule: supernemo.vo.eu-egee.org GlueServiceAccessControlRule: na48 GlueServiceAccessControlRule: vo.nanocmos.ac.uk GlueServiceAccessControlRule: vo.londongrid.ac.uk GlueServiceAccessControlRule: ops
Basic Usage
A user interface can be configured to use any of these Resource Brokers (in the example below 'lcgrb01' can be replaced with 'lcgrb02')
# edg_wl_ui.conf
[
VirtualOrganisation = "dteam";
NSAddresses = "lcgrb01.gridpp.rl.ac.uk:7772";
LBAddresses = "lcgrb01.gridpp.rl.ac.uk:9000";
MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk"
]
# edg_wl_ui_cmd_var.conf
[
rank = - other.GlueCEStateEstimatedResponseTime;
requirements = other.GlueCEStateStatus == "Production";
RetryCount = 3;
ErrorStorage = "/tmp";
OutputStorage = "/tmp/jobOutput";
ListenerPort = 44000;
ListenerStorage = "/tmp";
LoggingTimeout = 30;
LoggingSyncTimeout = 30;
LoggingDestination = "lcgrb01.gridpp.rl.ac.uk:9002";
NSLoggerLevel = 0;
DefaultLogInfoLevel = 0;
DefaultStatusLevel = 0;
DefaultVo = "unspecified";
]
And finally submit a job with
$ edg-job-list-match --config edg_wl_ui.conf \
--config-vo edg_wl_ui_cmd_var.conf HelloWorld.jdl
Service Monitoring
- Ganglia Host Level Monitoring lcgrb01
- Ganglia Host Level Monitoring lcgrb02
- Ganglia Host Level Monitoring lcgrb03
The ganglia plots also indicates the number of jobs currently held within the logging and
bookkeeping service in various states.
| Job State | Plot Name | Description |
| ABORTED | jobs_aborted | Aborted by system (at any stage). |
| CANCELLED | jobs_cancelled | Cancelled by user. |
| CLEARED | jobs_cleared | Output transfered back to user and freed. |
| DONE | jobs_done | Execution finished, output is available. |
| READY | jobs_ready | Matching resources found. |
| RUNNING | jobs_running | Executable is running. |
| SCHEDULED | jobs_scheduled | Accepted by LRMS queue. |
| SUBMITTED | jobs_submitted | Entered by the user to the User Interface. |
| WAITING | jobs_waiting | Accepted by WMS, waiting for resource allocation. |
- RB/WMS Monitoring (thanks to Yvan Calas - CERN)
- RB/WMS Monitoring Tool HowTo
Alarms:
- If FD (Number of file descriptors opened by edg-wl-log_monitor process) gets in red (i.e. too large), then the following procedure is needed:
1. Edit /etc/cron.d/edg-wl-check-daemons and comment out the cron job.
2. Next:
----------------------------------------------------------------------
/etc/init.d/edg-wl-lm stop
cd /var/edgwl/logmonitor/CondorG.log/
find CondorG.*.log -mtime +30 -print -exec mv {} ./recycle/ \;
cd /
/etc/init.d/edg-wl-lm start
----------------------------------------------------------------------
3. Edit /etc/cron.d/edg-wl-check-daemons and uncomment the cron job.