Novell Home

Nagios 3.0 - Sample Check Program Integration for LDAP Statistics

Novell Cool Solutions: Feature
By Rainer Brunold

Digg This - Slashdot This

Posted: 1 Nov 2007
 

  • Open Enterprise Server
  • SUSE Linux Enterprise Server
  • SUSE Linux Enterprise Desktop
  • eDirectory

Chapter 3 – Integrating A New Check Program – Monitoring LDAP Statistics

Now we have a default installation of Nagios and NagiosGraph we can start to integrate new check programs.

Depending on the service you would like to monitor you will find that the already installed Nagios check programs in /opt/nagios/libexec will meet your requirements otherwise the NagiosExchange project web site (www.nagiosexchange.org).will hold a lot of different other check programs that are open source and easy to integrate into your Nagios installation.

During the last months we had in our company a server migration project where we moved a lot of Netware server to Linux and moved our complete eDirectory from Netware to OES Linux. We have several web applications that use the eDirectory for authentication and sometimes as a data store. We noticed that we had at the beginning some performance problems and that we had absolutely no idea how to monitor those ldap queries. At that time I found at the cool solutions an article about ldap monitoring and noticed that the ldap server provides some statistics about how many queries were made and how many errors there were. So I took this information and put all together into a new ldap check program for Nagios and published at the cool tools as well as the NagiosExchange project page the last weeks. The following article will show how to integrate a new check program into Nagios using that sample check program.

As a result of that integration you will be able to monitor your ldap server and be notified when the search level is above a warning or critical value as well as get some graphs on how many queries and errors there are. If you do not need the notification ignore that and just enjoy the new ldap graphs you will get.

Screen shot:

This is a sample screen shot from our ZEN Linux Management server and the eDirectory in the background showing how many different ldap queries there were made and how many errors occurred.

What do we have to do?

First we have to download the new check program and put it into the Nagios plugin directory. Next we have to add a command definition for it so Nagios knows how to start it and what parameters are required. To get some graphs from NagiosGraph we have to add an entry to the map file that contains a regular expression list that matches the check program output and finally we have to add a service definition for that new ldap service we would like to monitor. If we would like to activate mail notifications if the number of searches or errors is above a defined level we have to modify the default contact nagiosadmin.

1. Server Preparation

Based on the former cool solutions article we need that default Nagios and NagiosGraph installation on a system:

Here is the link to the Nagios default installation article: http://www.novell.com/coolsolutions/feature/19807.html

The article about the NagiosGraph default installation has not been published yet, please find it later linked to my authors page: http://www.novell.com/coolsolutions/author/1525.html

The check program itself requires the openldap2-client package to be installed. Please check if it is already installed otherwise use yast2 or the ZEN Linux Management (ZLM) client to install it.

The following command shows you how to check if it is already installed:

# rpm -q openldap2-client
If you need to install it you can use the following yast command:
# yast -i openldap2-client
or the ZLM client command:
# rug in openldap2-client

If you install that package using the graphical yast interface take care that you do not install all openldap2 packages. Some other packages might conflict with other software on your system !

If you would like to activate email notification for the service the postfix package has to be installed because Nagios delivers the mail to the local postfix process and that one is responsible to forward it to the next smtp server.

2. Software Download and Extraction

There is only one single bash script file required for this ldap monitoring:

Software: check_edir_ldap_stats.sh

Download Link: http://www.nagiosexchange.org/Novell.113.0.html?&tx_netnagext_pi1[p_view]=1118

Download that script and copy it to the default Nagios check program directory:

# cp <check_edir_ldap_stats.sh> /opt/nagios/libexec
# chmod +x /opt/nagios/libexec/check_edir_ldap_stats.sh

3. A New Command Definition

Every check program that might be used in your Nagios installation has to be configured in the command configuration file (commands.cfg). This definition allows Nagios to start the check program with all required parameters. So please add this definition to your commands.cfg file..

# vi /opt/nagios/etc/objects/commands.cfg
...
define  command {
        command_name    check_edir_ldap_stats
        command_line    $USER1$/check_edir_ldap_stats.sh -H $HOSTADDRESS$ -P $ARG1$ -T $ARG2$ -w $ARG3$ -c $ARG4$
        }
...

So what does this command definition in detail mean?

First the command_name will be used in the service definition to refer to this command. The command_line contains several parameters that are added dynamically to the command by Nagios when it is executed.

?$USER1$? is defined in the /opt/nagios/etc/resource.cfg and defines the path to the check program. In our installation this is /opt/nagios/libexec. As I described in the NagiosGraph article the resource.cfg is used to store sensitive data from your Nagios installation. That file is currently the one that is read only by the nagios user itself. Most other are readable by all other users. So if you need to store somewhere sensitive user names and passwords for eg. ldap binds or ftp server connects, store them in that file.

The command definition refers to entries from that file by using $USER1$ to $USER32$.

The command_line then continues with the name of the executable itself, independent if that is a simple bash script, a perl program, a c++ binary or what else. Every programming language that run on that linux system can be used to write a check command.

After the executable name the parameters for it will follow. These parameters vary from one check command to the other. A good way to see what parameters a check program requires is by starting it in a shell with the option ?help or -h. This normally should show a description of it. You can do it for this one as well. Here is a short summary of the parameters of the check_edir_ldap_stats.sh:

-H host name or ip address that needs to be checked
-P port of the ldap server (389 for ldap, 636 for ldaps or any other port)
-T ldap type (ldap or ldaps)
-w number of ldap queries or errors per second that should raise a warning message
-c number of ldap queries or errors per second that should raise a critical message
-u / -p these parameters might be used if the anonymous bind to your ldap server is disabled and you need to provide a user and password for the ldap bind. We do not use them here, I assume your server has anonymous bind allowed.

The parameters can contain also several types of so called Nagios macros as well as arguments that are defined in the service definition later in this article. The idea behind this is that one single command definition should be as flexible as possible to be used in as much service definitions as possible. So imagine you have to 3 ldap server to monitor. You need each servers ip address in the command or service definition. That would require three command or three service definitions if you do not use the Nagios macros. As the ip address of each server that will be monitored is already defined in the host definition for the monitored host, Nagios refers to that entry using the macro $HOSTADDRESS$. So when you use that in the command definition Nagios will put there automatically the ip address of the server that has to be checked.

The next parameters $ARG1$ to $ARG4$ are defined in the service definition later in this article.

They are defined there and inserted into the command before it is actually executed.

4. A New Host Definition

Our default Nagios installation has just the localhost currently configured to be monitored.

For this reason we have to add another host definition of the eDirectory server we would like to monitor.

At this point the default Nagios configuration is nice for just a few hosts and services to be monitored, but when you start to increase that number a little modification will make it much easier for you for the future. I would suggest you to do the following to structure your configuration a little bit better.

The Nagios main configuration file /etc/nagios/etc/nagios.cfg refers to the other configuration files using the parameter cfg_file. Right now in line 38 there is the include for the localhost.cfg. As we need to add another host to be monitored we can add just another cfg_file line pointing to the new configuration file or we change that configuration from cfg_file to cfg_dir and provide a new directory that holds all host definitions including that one for the localhost. In that case Nagios will read all .cfg file from that cfg_dir when it is started or reloaded and you do not have to change the main configuration file whenever you add just a single host. Just put that file into that directory and reload Nagios and the new host will be added.

Note: This can be done not only for host definitions, it can be used for any other configuration files. In our configuration we have dedicated directories for the host and the service configurations. We monitor right now about 500 hosts and about 9000 services. And that cfg_dir directive helped a lot !

# mkdir -p /opt/nagios/etc/objects/hosts
# mv /opt/nagios/etc/objects/localhost.cfg /opt/nagios/etc/objects/hosts

Next edit the Nagios main configuration file and deactivate the cfg_file line for the localhost.cfg and add the cfg_dir pointing to the new directory:

# vi /opt/nagios/etc/nagios.cfg
...
#cfg_file=/opt/nagios/etc/objects/localhost.cfg               remark this line
cfg_dir=/opt/nagios/etc/objects/hosts
...

Now we have to add the host definition for your eDirectory server you would like to monitor to that hosts directory. A host definition can contain about 43 different parameters, but only 9 are mandatory. I describe only this 9 ones including those are required to activate the notification if something goes wrong.

You can call the following host definition file matching your eDirectory server name. I call it LX-TZLM09.cfg as this is the dns server name of that host.

# vi /opt/nagios/etc/objects/hosts/LX-TZLM09.cfg
define host{
        host_name               LX-TZLM09
        alias                   LX-TZLM09
        address                 10.10.10.1
        check_command           check-host-alive
        check_interval          5
        max_check_attempts      3
        retry_interval          1
        active_checks_enabled   1
        passive_checks_enabled  1
        check_period            24x7
        contacts                nagiosadmin
        notification_interval   60
        notification_period     24x7
        notification_options    d,u,r
        notifications_enabled   1
        }

Along with the Nagios installation on your system there is also a very good detailed documentation installed which describes all of the possible object configuration values. You will find the documentation page for the host definition following this link on your own Nagios server: http://<your Nagios hostname>/nagios/docs/objectdefinitions.html#host

Please check the link above for detailed information about each configuration option. I will just describe a few one here:

address that can be the ip address of the server or a dns name of it. But take care when the dns server is down this host cannot be checked because no name resolution might be available. We use here always the ip address.
The value of this parameters is referenced in the command configuration file, as the macro $HOSTADDRESS$. When you assign the ldap service to this server, Nagios will put this value into the command line of the check program before it is executed.
check_command,
check_interval,
may_check_attempts,
retry_interval,
check_period
Nagios knows two different check types. So called host and service checks. The one are used to check if the host is available, something like a ping does and the other one are service checks that really check a dedicated service. These values here are responsible for the host checks, as they are in the host definition.
Nagios repeats the check_command (see the definition in the commands.cfg) during the whole day (24x7 - check_interval) for this host every 5 minutes (check_interval). If the check_command does not report that the host us up it will retry that check command immediately 3 times (max_check_attempts) and if it is still not reported as up it will be shown in the web front end as down. After it is reported as down the host will be rechecked every minute (retry_interval) till it is back up. After that it is checked again all 5 minutes (check_interval).
All time values are set in minutes.
active_checks_enabled this defines if the host checks will be executed
passive_host_checks this will allow passive checks for this host that are required when you have a distributed Nagios monitoring installation. This is helpful when you have a lot of systems to monitor and you want to distribute the check tasks through different servers. One of the following articles will cover the NSCA service for distributed monitoring.
contacts contacts are used on one side for sending notifications if something goes wrong with this host and are also used for restricting access to hosts and services in the web front end. If you have more contacts defined on your Nagios system and assign them to different hosts and services, they are only able to see those hosts and services in the web front end they are assigned to. Right now we have only the nagiosadmin configured and have that one assigned to all hosts and services. Because of that and because the cgi.cfg contains some lines about special user permissions the nagiosadmin user can do everything in the web front end.

Instead of contacts here you can also use contact_groups to build groups of contacts. This is helpful when you have more administrators.
notification_interval,
notification_period,
notification_options,
notification_enabled
as the notifications are enabled for this host (notifications_enabled) Nagios will send you notifications whenever the host state changes to down (d), unknown (u) or it recovers to the up state (r) like you have configure in the notifications_options. If the host stays in the down or unknown state it will repeat the notification every 60 minutes (notification_interval) during the whole day (24x7 ? notification_period).

If you monitor more hosts using your Nagios installation it is helpful to define some host templates (they contain register 0 in their definition) and then assign those templates to the host definitions (templates can be reference by the use directive). By using this the host definition will be much smaller and you have just one definition to change if you would like to make modifications for all of them. We have dedicated host templates for netware, windows and linux servers in our environment. Please refer to the Nagios documentation for using templates.

5. A New Service Definition

The next step is now to define the ldap service and assign that to the newly created host. A service definition can contain about 30 different parameters, most similar to the host definition. I define here just the most important one we require for service monitoring, graph generation and notification.

# vi /opt/nagios/etc/objects/hosts/LX-TZLM09.cfg
...
define service{
        host_name               LX-TZLM09
        service_description     LDAP-Stats
        check_command           check_edir_ldap_stats!636!ldaps!10!20
        max_check_attempts      1
        check_interval          5
        retry_interval          5
        check_period            24x7
        active_checks_enabled   1
        passive_checks_enabled  1
        notifications_enabled   1
        notification_interval   30
        notification_period     24x7
        notification_options    w,c,r
        contacts                nagiosadmin
        }
...

A full description of all parameters can be found again on your local Nagios server following this link: http://<your Nagios hostname>/nagios/docs/objectdefinitions.html#service

I just describe here a few parameters, the rest is explained at the host definition.
host_name This defines the host to which we attach this service. If you use host groups you can use the tag hostgroup_name to assign this service to a group of hosts. Imagine creating a group called ldap_server, assign all eDirectory server to that group and add the service to that group. Using groups can reduce the configuration steps in Nagios very much ! Use host groups as much as possible !
service_description This is a short name that is listed in the Nagios front end.
check_command This is the link to the command we defined before. You can configure here additional parameters, separated by a ?!? that are submitted to the check program and are referenced there as $ARG1$, $ARG2$, ... and so on.

Building now the whole check command line knowing the command definition and the service definition will result in the following command line that Nagios will execute whenever the check interval of this service is reached:

/opt/nagios/libexec/check_edir_ldap_stats.sh -H 10.10.10.1 -P 636 -T ldaps \
                                             -w 10 -c 20

Attention: Now you can now run this command just in a remote session on your server to see what result you will get. Because this command writes some temporary files, your currently used user will be the owner of them. I assume that you work as root right now. Later when we activate the changes in Nagios, this check program is executed as nagios user and that one will then have no write permissions to those temporary files. So if you run it, please clean the /temp/ldap_* files after that at your Nagios server. So they will be created the next time from the nagios user and everything will be fine.

Running the command the first time will just write the current ldap values to the temporary file:

Script started the first time, writing just the history file /tmp/ldap_history_LX-TZLM09.tmp

Waiting a few seconds and rerunning the command will show you the number of the different ldap searches and errors per second since the last run:

LDAPSTATS OK: wholeSubtreeSearchOps: 4 oneLevelSearchOps: 7 searchOps: 2 errors: 0 securityErrors: 0 - warn: 10 crit: 20

The check command does itself a ldap search of the category searchOps. So if your ldap server is very busy and has nothing to do you will see just the searchOps request done by this script. The script queries the ldap statistics from the server, writes them together with the actual time stamp to the temporary files. When it is started the next time it queries again the ldap statistics, calculates the differences between the values and divides them by the number of seconds since the last query. Everything after the comma is deleted, there is no rounding. By that way we get the average number of searches and errors per second.

When Nagios does the service check the definition has a check_interval of 5 minutes = 300 seconds. So if you have fewer then 299 ldap searches on your server during that time you will never get a value greater then 0. Those number of ldap searches per second are not so interesting. We are using this to monitor ldap server with queries greater then 50 per second on a average of 5 minutes!

NOTE: As the check program is open source, you can modify it as you want to match exactly what you need. If you want to monitor smaller number of ldap queries change the code and use the modified script and please publish the modifications, other ones may thank you for that.

NOTE: Setting the warning and the critical value of the check program by providing the third and fourth parameter to 10 and 20 will monitor all 5 different ldap values if anyone of them is above that limit and if so the service ill go to warning or to critical. You can modify the values to meet your requirements. If the check program does not exactly what you need, it's just a bash script you can modify and adopt it as you need it.

6. NagiosGraph Configuration

Now we know how the output of the check program looks like, we can add an entry to the NagiosGraph regular expression file.

Sample output of the check program:

LDAPSTATS OK: wholeSubtreeSearchOps: 4 oneLevelSearchOps: 7 searchOps: 2 errors: 0 securityErrors: 0 - warn: 10 crit: 20

Entry we add to the /opt/nagiosgraph/map file at the end:

# vi /opt/nagiosgraph/map
...
# LDAPSTATS OK: wholeSubtreeSearchOps: 4 oneLevelSearchOps: 7 searchOps: 2 errors: 0 securityErrors: 0 - warn: 10 crit: 20
/output:LDAPSTATS .* wholeSubtreeSearchOps: (\d+) oneLevelSearchOps: (\d+) searchOps: (\d+) errors: (\d+) securityErrors: (\d+).*/
and push @s, [ ldapstats,
               [ subtree,    GAUGE, $1 ],
               [ onelevel,   GAUGE, $2 ],
               [ search,     GAUGE, $3 ],
               [ errors,     GAUGE, $4 ],
               [ secerrors,  GAUGE, $5 ]
             ];

Please take care that there is no empty line at the end of the map file.

This definition in the NagiosGraph configuration file should match the service check output and will generate the round robin database we need for creating the graphs in the next section.

I do not want to go in detail here to much because the last cool solution article about NagiosGraph should provide enough information, just that you do not need to put all values from the service output into the database. You see that the regular expression stops after the ?securityErrors? and the ?warn? and ?crit? values aren't written to the database. If you would like to see them in the graph extend the regular expression and the database definition.

7. Insert Graph Definition into the Service Definition

Coming to the end of the different configurations we have to add the link to the NagiosGraph into the service definition we made earlier. We use the notes_url in the service definition to point to the graphs.

# vi /opt/nagios/etc/objects/hosts/LX-TZLM09.cfg
...
define service{
        host_name               LX-TZLM09
...
        contacts                nagiosadmin
        notes_url     /nagiosgraph/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$&db=ldapstats,subtree,onelevel,search&db=ldapstats,errors,secerrors&geom=500x100&rrdopts=%2Dl%200%20%2Dt%20LDAP%2DStatistics
        }
...

NOTE: In the second article about installing NagiosGraph I created another configuration block with the tag ?serviceextinfo? and put there the notes_url in. I noticed that Nagios version 3 has some changes in the configurations files and you can put this notes_url now into the service definition itself. That makes it a little bit more readable and easier to administer. But Nagios is backward compatible so you can still use the old style. It's always a good idea to check the changelog for new versions to see what has changed at www.nagios.org.

8. Configure the Notifications

The notification is done using a script that is defined in the commands.cfg. So whatever notification method you have available on your linux system (sms, email, ...) can be integrated into Nagios. We use here the easiest way to notify the administrator. We send him an email.

First we need a contact who will get that emails. The Nagios default configuration has the nagiosadmin configured, so we will use that one for sending the emails. Let's check the contact definition for him and see what he has configured.

# cat /opt/nagios/etc/objects/contacts.cfg
...
define contact{
        contact_name                    nagiosadmin
        use                             generic-contact
        alias                           Nagios Admin
        email                           nagios@localhost
        }
...

There is not so much information in, except the email address, which we will change to match your own email address and the reference to a template called ?generic-contact?. So let's get the information from that template and see what's configured there:

# cat /opt/nagios/etc/objects/contacts.cfg
...
define contact{
        name                            generic-contact
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r,f,s
        host_notification_options       d,u,r,f,s
        service_notification_commands   notify-service-by-email
        host_notification_commands      notify-host-by-email
        register                        0
        }
...

That is nearly what we need. The user nagiosadmin inherits the all options from the template. He will receive host and service notifications around the clock (host and service_notification_period is 24x7). He will receive emails for all available host and service status changes (host and service_notification_options). And he will receive the host and service notifications via email using those notify by email scripts.

The notification options what kind of status changes the user would be informed is set to all possible types. This will allow the user to receive notifications for all types. But as we have configured in the service definition for the ldap service just to send notifications on warning, critical and recovery he will only get those emails.

Here you can define either at the service level to send at any change a notification and then filter at the user level to get just warnings and criticals or you can define at the user level that he receives all types and set at the service level just to notify for different events. I think defining at the service level is the better way because later one you might have services monitored where you do not want any notification.

The type on how the notification is sent to the user can be configured for each user seperatly. In this sample he is informed by an email for any host and service changes. If you would like to change this and use a different method, add your notification script to the Nagios command definition and assign the new command name at the user definition.

So the existing user definition is okay for us and we just have to modify the email address to match your address:

# vi /opt/nagios/etc/objects/contacts.cfg
...
define contact{
...
        email                           firstname.lastname@domain.com
        }
...

When we check the notification command that is configured at the user we see that there are a lot of Nagios internal macros used to generate the email. The email itself is sent to the local postfix on your server and that one has to be configured to forward the emails to a responsible smtp server.

# cat /opt/nagios/etc/objects/commands.cfg
...
define command{
        command_name    notify-service-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
        }
...

To enable your local postfix server to forward emails to another smtp server you have to make just one configuration and then activate that one and restart the postfix process.

# vi /etc/sysconfig/postfix
...
POSTFIX_RELAYHOST="<smtp mail relay server>"
...

# SuSEconfig
# rcpostfix restart

9. Activate the Nagios Configurations

All configurations are done now and we have to activate them by restarting Nagios.

This is necessary because we made changes to the Nagios main configuration file nagios.cfg.

# /etc/init.d/nagios restart

10. Test the new LDAP Service

After the restart Nagios starts monitoring the new ldap service, NagiosGraph will process the service output and we should be able to see that service in the web front end along with the link to the NagiosGrpah.

In my environment that ldap service reports that the ldap service is in state critical because I set the critical level to 20 and there are already 106 searches per second as an average over the last 5 minutes. So I assume that there must be an email notification about that state in my email box. And here it is:

subject:   ** PROBLEM Service Alert: LX-TZLM09/LDAP-Stats is CRITICAL **

***** Nagios *****

Notification Type: PROBLEM

Service: LDAP-Stats
Host: LX-TZLM09
Address: 10.10.10.1
State: CRITICAL

Date/Time: Sat Oct 20 16:39:29 CEST 2007

Additional Info:

LDAPSTATS CRITICAL: wholeSubtreeSearchOps: 28 oneLevelSearchOps: 7 searchOps: 106 errors: 0 securityErrors: 0 - warn: 10 crit: 20

And waiting a few hours the graphs will contain the information about the trends as well:

11. Additional Hints

It might be the case that you do not only get email notifications from the ldap service, you might get them from all other services as well. This is because the option notifications_enabled is activated in the main Nagios configuration file nagios.cfg and is inherited to all service and host definitions. You can disable notifications for each host and service speratly by adding ?notifications_enabled 0? to its definition. After that reload the Nagios configuration and you will get just the remaining notifications.

If your ldap service is always at the okay state you can modify the warning and critical values to produce a notification or you can go to the service details at the web front end submit there a passive check result for that service with the status WARNING or CRITICAL. That one will trigger the same notification as if the ldap script has run and reported such a state. After you submitted that passive check result the service will stay till the next check_interval is being reached at that level and then come back to the one the check program reports.

If you are not sure if the mail is delivered correctly you can take a look at the /var/log/mail file which is the log file of the postfix daemon.

Rainer Brunold


Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com

© 2014 Novell