RAL Tier1 Workload Management
Important note: As from 1st December 2008, there is no more lcg-RB service offered by RAL Tier1 (lcgrb01, lcgrb02 and lcgrb03.gridpp.rl.ac.uk servers have been decommissioned). The information below is out of date and will be replaced by proper glite-WMSLB related information early January 2009. Users may want however to submit job to lcgwms01 and lcgwms02.gridpp.rl.ac.uk using glite-wms-job-* tools.
Contents
Service Endpoints
The RAL Tier1 runs a LCG Workload Management System or Resource Broker on three machines: lcgrb01.gridpp.rl.ac.uk, lcgrb02.gridpp.rl.ac.uk and lcgrb03.gridpp.rl.ac.uk
A list of VOs that the RBs support can be found from
  ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 \
      -b 'Mds-vo-name=RAL-LCG2,o=Grid' '(GlueServiceType=ResourceBroker)' \
      GlueServiceAccessControlRule
As on 15 October 2008
GlueServiceAccessControlRule: atlas GlueServiceAccessControlRule: alice GlueServiceAccessControlRule: lhcb GlueServiceAccessControlRule: cms GlueServiceAccessControlRule: biomed GlueServiceAccessControlRule: zeus GlueServiceAccessControlRule: hone GlueServiceAccessControlRule: cdf GlueServiceAccessControlRule: dzero GlueServiceAccessControlRule: babar GlueServiceAccessControlRule: pheno GlueServiceAccessControlRule: t2k GlueServiceAccessControlRule: esr GlueServiceAccessControlRule: ilc GlueServiceAccessControlRule: magic GlueServiceAccessControlRule: minos.vo.gridpp.ac.uk GlueServiceAccessControlRule: mice GlueServiceAccessControlRule: dteam GlueServiceAccessControlRule: fusion GlueServiceAccessControlRule: geant4 GlueServiceAccessControlRule: cedar GlueServiceAccessControlRule: manmace GlueServiceAccessControlRule: gridpp GlueServiceAccessControlRule: ngs.ac.uk GlueServiceAccessControlRule: camont GlueServiceAccessControlRule: totalep GlueServiceAccessControlRule: vo.southgrid.ac.uk GlueServiceAccessControlRule: vo.northgrid.ac.uk GlueServiceAccessControlRule: vo.scotgrid.ac.uk GlueServiceAccessControlRule: supernemo.vo.eu-egee.org GlueServiceAccessControlRule: na48 GlueServiceAccessControlRule: vo.nanocmos.ac.uk GlueServiceAccessControlRule: vo.londongrid.ac.uk GlueServiceAccessControlRule: ops
Basic Usage
A user interface can be configured to use any of these Resource Brokers (in the example below 'lcgrb01' can be replaced with 'lcgrb02')
 # edg_wl_ui.conf
 [
    VirtualOrganisation = "dteam";
    NSAddresses = "lcgrb01.gridpp.rl.ac.uk:7772";
    LBAddresses = "lcgrb01.gridpp.rl.ac.uk:9000";
    MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk"
 ]
 # edg_wl_ui_cmd_var.conf
 [
    rank = - other.GlueCEStateEstimatedResponseTime;
    requirements = other.GlueCEStateStatus == "Production";
    RetryCount = 3; 
    ErrorStorage = "/tmp";
    OutputStorage = "/tmp/jobOutput";
    ListenerPort = 44000;
    ListenerStorage = "/tmp";
    LoggingTimeout = 30;
    LoggingSyncTimeout = 30;
    LoggingDestination = "lcgrb01.gridpp.rl.ac.uk:9002";
    NSLoggerLevel = 0;
    DefaultLogInfoLevel = 0;
    DefaultStatusLevel = 0;
    DefaultVo = "unspecified";
 ]
And finally submit a job with
 $ edg-job-list-match --config edg_wl_ui.conf \
       --config-vo edg_wl_ui_cmd_var.conf  HelloWorld.jdl
Service Monitoring
- Ganglia Host Level Monitoring lcgrb01
 - Ganglia Host Level Monitoring lcgrb02
 - Ganglia Host Level Monitoring lcgrb03
 
The ganglia plots also indicates the number of jobs currently held within the logging and 
bookkeeping service in various states.
| Job State | Plot Name | Description | 
| ABORTED | jobs_aborted | Aborted by system (at any stage). | 
| CANCELLED | jobs_cancelled | Cancelled by user. | 
| CLEARED | jobs_cleared | Output transfered back to user and freed. | 
| DONE | jobs_done | Execution finished, output is available. | 
| READY | jobs_ready | Matching resources found. | 
| RUNNING | jobs_running | Executable is running. | 
| SCHEDULED | jobs_scheduled | Accepted by LRMS queue. | 
| SUBMITTED | jobs_submitted | Entered by the user to the User Interface. | 
| WAITING | jobs_waiting | Accepted by WMS, waiting for resource allocation. | 
- RB/WMS Monitoring (thanks to Yvan Calas - CERN)
 - RB/WMS Monitoring Tool HowTo
 
Alarms:
- If FD (Number of file descriptors opened by edg-wl-log_monitor process) gets in red (i.e. too large), then the following procedure is needed:
1. Edit /etc/cron.d/edg-wl-check-daemons and comment out the cron job.
2. Next:
----------------------------------------------------------------------
/etc/init.d/edg-wl-lm stop
cd /var/edgwl/logmonitor/CondorG.log/
find CondorG.*.log -mtime +30 -print -exec mv {} ./recycle/ \;
cd /
/etc/init.d/edg-wl-lm start
----------------------------------------------------------------------
3. Edit /etc/cron.d/edg-wl-check-daemons and uncomment the cron job.