RALPP Work List Nagios
Progress on installing Nagios monitoring for the RALPP Tier 2
Contents
04/07/2006
Installed nagios rpms, set up very basic config
Installed nagios and nagios-plugins on heplnx182
Configured http access with:
htpasswd -c /etc/nagios/htpasswd.users nagiosadmin
Tried starting the nagios service, it complained about problems with the config file, I had to edit /etc/nagios/nagios.cfg to comment out all the cfg_file entries other than minimal.cfg.
The nagios service then started and I could log into the web interface (after making apache reload its config) but I couldn't see any info on the one host in the config (localhost).
Eventually discovered that I had to edit /etc/nagios/cgi.cfg to enable the userid I'd just setup nagiosadmin permission to access various bits of the CGI. Unscientifically enabled everything in sight. Now I can see that status of localhost.
19/09/2006
Started messing with nrpe to do remote monitoring
Installed nagios-npre-plugin on heplnx182 Installed nagios-npre and nagios-plugins on heplnx10
Opened TCP port 5666 on heplnx10 for nrpe service and edited /etc/nagios/nrpe.cfg to allow connections from heplnx182
On heplnx182 ran:
[root@heplnx182 nagios]# /usr/lib/nagios/plugins/check_nrpe -H heplnx10.pp.rl.ac.uk -c check_users USERS OK - 1 users currently logged in |users=1;5;10;0
Looks good!
Edited /etc/nagios/minimal.cfg on heplnx182 to include a new command, host and services:
define command{
	command_name	check_system_disk
	command_line	$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_system_disk
	}
define host{
        use                     generic-host            ; Name of host template to use
        host_name               heplnx10
        alias                   heplnx10
        address                 130.246.43.10
        check_command           check-host-alive
        max_check_attempts      10
        check_period		24x7
        notification_interval   120
        notification_period     24x7
        notification_options    d,r
        contact_groups  admins
        }
define service{
        use                             generic-service         ; Name of service template to use
        host_name                       heplnx10
        service_description             PING
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              4
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
	notification_options		w,u,c,r
        notification_interval           960
        notification_period             24x7
	check_command			check_ping!100.0,20%!500.0,60%
        }
define service{
        use                             generic-service         ; Name of service template to use
        host_name                       heplnx10
        service_description             Root Partition
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              4
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
	notification_options		w,u,c,r
        notification_interval           960
        notification_period             24x7
	check_command			check_system_disk!20%!10%!/
        }
and added heplnx10 to the test nodes nodegroup
After reloading nagios heplnx10 apppear on the web pages and the two services were checked.
Then I created a new config file for user-inferfaces.cfg and added commands, a host (heplnx101), a host group for the user interfaces and some services for general things. I added that to the general nagios.cfg and installed the two rpms on heplnx101.
nagios started monitoring the services as expected.
I then created a new notification group for me and stopped the nrpe service on heplnx101 and waited for the checks to go critical. Distinct lack of emails. will look at that later (Ah, e-mail doesn't work with sendmail stopped!).
20-26/09/2006
Started moving to a more permanent set up
Logical file structure
I've split the files up into each different type of definition, so I have:
commands.cfg generic-templates.cfg hostgroups.cfg servicegroups.cfg service-templates.cfg time-periods.cfg
Then each host group has a directory containing a file with the host template and service definitions and a file with the host definitions.
hierachy of templates
So for instance:
- generic-grid-worker-host inherits from
- generic-linux-host which in turn inherits from
- generic-host
Most of the host definition is contained in these templates so the actual host definitions looks like:
define host{
	use			generic-grid-worker-host
	host_name		heplnc001
	alias			heplnc001.pp.rl.ac.uk
	address			130.246.45.1
}
The same is also true of services, where I define a generic-serivce-template for each service (say system-load-service-template) that defines everything the service does apart from which nodes it applies to. Then the indevidual service definintions use the hostgroups to apply the services to nodes:
define service{
        use                             system-load-service-template
	hostgroups			7-GridWorkers
        }
This even works quite well for individual service instances, like checking a web is accessible. I just define the service tamplate in the normal way then overide the service_description and check_command like this:
define service{ 
        use                             http-url-service-template
        host_name                       heplnx182 
        service_description             ganglia web accessable 
        check_command                   check_http_url!ganglia.gridpp.rl.ac.uk!/
        } 
I've now installed it on most of the nodes, we're now checking 482 services on 118 hosts and have started to tailor the services to the hosts.
Chris brew 18:52, 26 Sep 2006 (BST)
