Nagios
Nagios is a Network / Host monitoring package available under the GPL. See Either the Wikipedia Summary or the Product Homepage for more details.
Gridpp operates a UK-wide Nagios, info is here: http://www.gridpp.ac.uk/wiki/UKI_Regional_Nagios
Although not promarily designed as Monitoring_Tools_for_LCG it can provide administrators with alerts on failing services and potentially restart them, as well as provide availability statistics.
Monitoring Plugins
Are documented on a Separate Page.
Remote Hosts
Because Nagios runs on a central server, it can only interrogate the remote state of machines if they are somehow accessible over the network. This means that it can run any monitor on localhost but is restricted to the following for remote ones:
- Network services (ie, check_ssh used to see if there's an sshd service on target host)
- 'Polled' local scripts sending back over a secure pipe (NRPE)
- 'Pushed' results of passive / active checks back to nagios server (NSCA)
Configuration Tips
- See what others are doing - eg RALPP_Work_List_Nagios
- Generate templates automatically to make repetetive groups simple. ie Andrew Elwell has a set of shell scripts for each type of node (worker, server, disk) that contain loops such as:
for i in `seq 1 140` ; do
h=`printf "%03d" $i`
cat <<EOF >> $CFG
define host {
host_name node$h
alias Worker Node $h
address 10.141.0.$i
use wn_template
}
EOF
done
Rather than defining each service on each node individually, you can then add it to a group at once:
define hostgroup{
alias Worker Nodes
hostgroup_name workernodes
}
define host{
name wn_template
use linux-server
hostgroups workernodes
register 0
}
define service{
hostgroup_name workernodes
service_description sshd
check_command check_ssh
servicegroups sshservers
use local-service
}
- Group all the services together using servicegroups
- If you already restrict access to the webserver that nagios runs under (htaccess or SSL/x509), then you can set the cgi.cfg to allow user * and it'll use $REMOTE_USER within nagios
- an example SSL Configuration. This is for Apache 2, and also includes an example of how to apply basic certificate ACLs from within the nagios config.
SSLEngine on
SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL
SSLCertificateFile /etc/apache2/ssl/nagios-hostcert.pem
SSLCertificateKeyFile /etc/apache2/ssl/nagios-hostkey.pem
SSLCACertificatePath /etc/grid-security/certificates
SSLCACertificateFile /etc/apache2/ssl/cacert.crt
SSLOptions +ExportCertData +CompatEnvVars +StdEnvVars
SSLVerifyClient require
SSLVerifyDepth 2
SSLUserName SSL_CLIENT_S_DN
<Location /nagios>
SSLRequire %{SSL_CLIENT_S_DN} eq "/C=UK/O=eScience/OU=Manchester/L=HEP/CN=colin morey" \
or %{SSL_CLIENT_S_DN} eq "/C=UK/O=eScience/OU=Manchester/L=HEP/CN=Someone Else"
</Location>
Notifications
By Default Nagios comes with email notifications, but can easily be extended to notify with pagers, sms or even Jabber