Novell Home

Nagios: Host and Service Monitoring Tool

Novell Cool Solutions: Feature
By Rainer Brunold

Digg This - Slashdot This

Posted: 19 Jan 2006
 

Nagios:

Host and Service Monitoring Tool - quick overview and basic installation guide

What is Nagios ?


This is a sample screen from our installation / service problems view.

Basically Nagios (http://www.nagios.org) is an open source host, service and network monitoring tool. It let's you manage different types of services and hosts running on different operating systems like linux, netware, windows, aix ,.... It's flexible in configuration and can be extended as much as you want. It's configured within text files and managed with a web browser.

When you do a basic installation you get a set of Nagios check programs which let you start monitoring your first hosts and services, beginning with the installation in less than an hour.

Based on that default configuration you can start extending the configuration to your special needs.

You can use the existing check programs or even add more check programs if you take a look at http://www.nagiosexchange.org where other developers put a lot of their check programs for download. Or just write your own in any programming language that is available on linux. Based on the services, you can setup time frames for the monitoring process as well as notifications when alarms arise. You can, but do not have to. It's always your choice and that's great with Nagios. If you do not want to setup any notifications, ignore its configuration, you can always watch the service problems in the web browser. That is what I currently use. It's good to come to the office in the morning take one look at the Nagios service problems view and know what is going on. Or when I'll do an eDirectory migration in our 260+ netware server tree Nagios is open in the background to see which servers aren't reachable.

Here is a list of check programs in the base system:

check_breeze check_http check_nt check_ssh

check_by_ssh check_icmp check_ntp check_swap

check_dhcp check_ifoperstatus check_nwstat check_tcp

check_dig check_ifstatus check_oracle check_time

check_disk check_imap check_overcr check_udp

check_disk_smb check_ircd check_ping check_udp2

check_dns check_ldaps check_pop check_ups

check_dummy check_load check_procs check_users

check_file_age check_log check_radius check_wave

check_flexlm check_mailq check_real negate

check_fping check_mrtg check_rpc urlize

check_ftp check_mrtgtraf check_sensors utils.pm

check_hpjd check_nagios check_smtp utils.sh

check_nntp check_snmp

At http://www.nagiosexchange.org you'll find a lot more if you are missing something here.

Here are some things we wanted to solve with Nagios:

We are running a lot of different management solutions (ZEN for server, IBM Tivoli, HP / Compaq System Insight Manager, McAfee ePolicy Orchestrator, ......) and all of them are in some parts very strong and powerful. The problem is, that some of them are really time consuming installations, configurations and administrations. It's problematic to add own services and hard to get a single view of all current services and the problems that existing.

So I went around looking for a solution that is even quiet easy to deploy but as powerful as possible to handle all our requirements. Of course I will try to drop out some other existing management solutions in our company for this, but I am sure it's not possible to drop them all for this one. A combination of them will be the best for us.

Here are some samples (beside a lot of default monitoring requirements like cpu, filesystem, ...) I had to deal with during the day to day administration and that should be able to automate with Nagios:

  • check Bordermanager SurfControl logfiles to see if the update happened

  • check ftp/tls and ftp/ssh server functionality with a real ftp user connect

  • check time synchronization on oes/linux server

  • check running virus scanner processes (like LinuxShield from McAfee)

  • check HP Proliant server status from the HP Proliant support pack agents

  • check DNS round robin functionality

  • check Vmware remote connection and web interface ports

  • check ZEN linux management mirror process for ZLM and RCD targets

  • check memory / swap usage on linux servers

  • check linux drbd synchronization status

  • check server availability on one screen during eg. eDirectory migrations

  • check ldap functionality

  • check web applications

  • check oes/linux cluster ressources

  • ....

So how does it work ?

Just to say it in a simple way, you define a host by it's name and ip address and a view other parameters and assign it some services. Behind a service is nothing other than a check program configured that runs in a predefined interval some tests. Like the ftp services uses the check_ftp program to do some ftp connects to the server. It reports the result of it with a exit code back to Nagios. Exit code 0, if the test was okay = green, exit code 1 means warning = yellow, exit code 2 means critical = red and exit code 3 means unknown = orange.

Nagios itself shows you in the service problems view, the currently existing problems (hopefully this list is short or even empty) or in the service details a list of all configured services and their status.

Depending on your notification configuration it also notifies you with regarding information.

If you are interested in such a system monitoring tool the time is well worth to do a test installation and spend a view hours with it. We are monitoring currently 308 hosts and about 1058 services.

Additional configurations / extensions:

There are a lot of configurations / extensions that are not covered in this document. As you discover more and more of the possible configurations you will find things like how to put a web application link behind a host extra note. Like a host is down, just select it's extra host notes and you are forwarded directly to the eg. HP remote insight board. Or if a server shows some warnings from the HP Proliant agents, you can select the additional service tasks and come directly to the Proliant web management interface on the server. There is no need for us to enter the url for that web page anymore or even take a look at the HP system insight manager for server status. That's all in Nagios now.

I'll try to show you how you can setup a Nagios installation with a little linux knowledge in less than an hour or two. I think that's the best way to take a look at it. After that I'll show how to add a new check program for Nagios to monitor if a single file exists.

So here is a basic installation guide:

Unnecessary to say that you should do it on a test server first. If you use eg. PuTTY for ssh connect to your server you can copy / paste the commands from this document directly into the shell. So you do not have to write them yourself.

  1. Do a default OES/Linux or SLES 9 - 32 bit installation.

    Nagios does not need very much memory (less than 32mb) and disk space (less than 50mb). So it could be a ?small? server or even a virtual machine.

  2. Install / remove some packages of your installation

    OES and SLES contains Nagios version 1.2 in it's distribution but I'll decided to use the current version 2.0 rc1 from
    http://www.nagios.org. So I had to remove and install a view packages with yast:

    remove if installed:
    nagios, nagios-nsca and nagios-plugins

    install if not yet done:
    gd-devel and libpng-devel packages
    and the whole
    Simple Webserver selection

  3. Download the Nagios and the plugin tarball

    Download the most recent Nagios and the official plugin tarball from the current Nagios version from
    http://www.nagios.org/download and copy it to /tmp on your test server.

    Right now there is the Nagios version 2.0rc1 available. Normally I prefer to use only rpms for installation, but this time I use the tarball so I can configure some additional parts during compilations and installation. If tried this installation procedure with other 2.0x version and it was working well.

    current nagios tarball:
    nagios-2.0rc1.tar.gz
    current plugin tarball:
    nagios-plugins-1.4.1.tar.gz

  4. Create the Nagios user and group

    You're probably going to want to run Nagios under a normal user account, so add a new user and group to your system with the following command:

    #
    useradd -m nagios
    # groupadd nagios


  5. Create the installation directory

    Create the base directory where you would like to install Nagios as follows...

    #
    mkdir /usr/local/nagios

    Change the owner of the base installation directory to be the Nagios user and group you added earlier as follows:

    #
    chown nagios.nagios /usr/local/nagios

  6. Identify the web server user

    You're probably going to want to issue external commands (like acknowledgements and scheduled downtime) from the web interface. To do so, you need to identify the user your web server runs as (typically wwwrun). This setting is found in your web server configuration file. The following command can be used to determine quickly what user Apache is running as:

    #
    grep -R "^User" /etc/apache2/*

    Normally the user is the wwwrun. We will add it to the Nagios group in the next step. If the user differs, be sure to use in step 7 the right one.

  7. Add a command file group

    Next we're going to create a new group whose members include the user your web server is running as and the user Nagios is running as. Let's say we call this new group 'nagcmd':

    #
    groupadd nagcmd

    Next, add the users that your web server and Nagios run as to the newly created group with the following commands:

    # usermod -G nagcmd wwwrun
    # usermod -G nagcmd nagios


  8. Extract the Nagios tarball

    # cd /tmp
    # tar -xvzf nagios-2.0rc1.tar.gz
    # cd nagios-2.0rc1


  9. Compile the Nagios package

    Run the configure script to initialize variables and create a Makefile as follows:

    #
    ./configure --prefix=/usr/local/nagios --with-cgiurl=/nagios/cgi-bin --with-htmurl=/nagios --with-nagios-user=nagios --with-nagios-group=nagios ?with-command-group=nagcmd

  10. Compile the binaries

    Compile Nagios and the CGIs with the following command:

    #
    make all

  11. Installing the binaries and HTML files

    Install the binaries and HTML files (documentation and main web page) with the following command:

    #
    make install

  12. Installing an init script

    If you want, you can also install the init script /etc/init.d/nagios with the following command:

    #
    make install-init

  13. Installing command mode

    If you want, you can also install the command mode environment with the following command:

    #
    make install-commandmode

  14. Installing sample config files

    Now we install some default configuration file, that have to be changed a little bit later:

    # make install-config

  15. Installing the plugins

    Plugins are usually installed in the libexec/ directory of your Nagios installation (i.e. /usr/local/nagios/libexec). Plugins are scripts or binaries which perform all the service and host checks that constitute monitoring.

    # cd /tmp
    # tar -xvzf nagios-plugins-1.4.1.tar.gz
    # cd nagios-plugins-1.4.1
    #
    ./configure
    # make
    # make install


  16. Setup the Apache web interface

    To make Nagios accessible through the apache web server we have to setup a config file for it. Create the config file as follows:


    #
    vi /etc/apache2/conf.d/nagios.conf

    Insert this elements into that new file:

    ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin
    <Directory "/usr/local/nagios/sbin">
    AllowOverride AuthConfig
    Options ExecCGI
    Order allow,deny
    Allow from all
    </Directory>

    Alias /nagios /usr/local/nagios/share
    <Directory "/usr/local/nagios/share">
    Options None
    AllowOverride AuthConfig
    Order allow,deny
    Allow from all
    </Directory>


    After that restart the apache web server with the following command:

    #
    rcapache2 restart

  17. Setup a minimum Nagios configuration

    # cd /usr/local/nagios/etc
    # cp cgi.cfg-sample cgi.cfg
    # cp nagios.cfg-sample nagios.cfg
    # cp minimal.cfg-sample minimal.cfg
    # cp resource.cfg-sample resource.cfg


    For proper use of this Nagios configuration we have to create two additional, empty config files. Do not copy the sample files for this one, there would be duplicate command definitions.

    # touch checkcommands.cfg
    # touch misccommands.cfg


  18. The last configuration steps ...

    Set the Nagios user and group as owner of the Nagios installation:

    # chown -R nagios.nagios /usr/local/nagios

    Deactivate the authentication for the cgi's:

    # vi cgi.cfg

    Search for the line ?use_authentication=1? and change it to ?use_authentication=0?. That's for testing easier to handle.
    But not all functions are possible if it's disabled.
    For production use later it should be activated but then you have to configure some other parts of Nagios as well.

  19. Verify the Nagios configuration and start it

    # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

    The ?Total Warnings? and ?Total Errors? should be 0 if you have done everything correct.
    If so just start it the first time:

    # /etc/init.d/nagios start

    Activate Nagios to start within the runlevel scripts automatically

    # insserv nagios

  20. Test Nagios access with a web browser

    http://<servername or ip address>/nagios

    You should see the following:






Now you can start exploring Nagios yourself.



Nice to know: All screens refresh every 30 seconds, no need to reload them.
That time can be changed even to lower values in the Nagios configuration files.

Just let me show you the most important views:



1. Tactical Overview:




This screen gives you an overview of the current status of the monitored services and hosts. Take a look at the ?Hosts? and you see that you are currently monitoring only one host. In the ?Services? line you see that you are monitoring 5 services and maybe all are reporting the status ?OK?. As a summary the ?Network Health? - Host and Service health bar is filled completely with green, indicating all configured hosts and services are OK.

Most fields on this page are links to more specific views. If you want to know more about your 5 monitored services you can either klick on the ?5 OK? filed, or choose ?Service Detail? on the left.



2. Service Detail



This screen shows you all the services configured to be monitored and their current status.
As we saw in the tactical screen here are our five services we monitor right now.

You can see basic informations about each service on this page:

Host the host to which this services are configured
If this field is marked red, the host itself is down,
if it's just grey the server is up and reachable with ping.
Status show the current status of the service
OK = green
Warning = yellow
Critical = red
Unknown = orange
Last Check date and time when it has been checked the last time
Duration shows for how long the service in this status
Attempt how many attempts were needed for the check
Status Information this is the output from the check program

Again if you want to know more about a single service, select it by its name and you are redirected to a more detailed page about it.



3. Host detail

This is the same view as the service detail, showing the details of the monitored hosts. Therefore I have no screen shot of it. You would see all configured hosts and have again the choice to select one to get more informations about it.



4. Service Problems


Hopefully this screen will be empty as long as possible. This is the screen that I have opened the day long. On top of this document is an actual screen shot of our system. There are some service problems. Hmm.. something to do for me ... Whenever a service reports a failure you will get the information on this page. The browser refreshes also every 30 seconds and you get the current list of failed services.

When a service reports a failure the line will be shown here. The interval a service should be checked can be configured in the Nagios configuration. The minimum interval is 1 minute. When the next check reports everything is okay for that service, it will disappear from this list. So this is the page where you can see the failures of your monitored services right now and even actual.





Last but not least for this article I want to show you how you can add another service with a new check program. I think this is the best way to understand the configuration for the hosts and services.



To show you how you could add new check programs on your own, here is an easy sample:

We add a simple bash script that checks if the file /tmp/nagios.chk is available. If it is there and it's executable the service goes to critical, if it is there and not executable it's going to warning and if it doesn't exist the service is ok.



  1. Create the executable check file

    # vi /usr/local/nagios/libexec/check_file_exist.sh

    Add the following to that file:

    #!/bin/bash
    #
    # Check if a local file exist
    #
    while getopts F: VAR
    do
    case "$VAR" in
    F ) LOGFILE=$OPTARG ;;
    * ) echo "wrong syntax: use $o -F <file to check>"
    exit 3 ;;
    esac
    done

    if test "$LOGFILE" = ""
    then
    echo "wrong syntax: use $0 -F <file to check>"
    # Nagios exit code 3 = status UNKNOWN = orange
    exit 3
    fi

    if test -e "$LOGFILE"
    then
    if test -x "$LOGFILE"
    then
    echo "Critical $LOGFILE is executable !"
    # Nagios exit code 2 = status CRITICAL = red
    exit 2
    else
    echo "Warning $LOGFILE exists !"
    # Nagios exit code 1 = status WARNING = yellow
    exit 1
    fi
    else
    echo "OK: $LOGFILE does not exist !"
    # Nagios exit code 0 = status OK = green
    exit 0
    fi


    Now set the file attributes:

    # chown nagios.nagios /usr/local/nagios/libexec/check_file_exist.sh
    # chmod +x /usr/local/nagios/libexec/check_file_exist.sh

  2. Add the check program to the nagios configuration

    Each new check command has to been defined once in the global Nagios configuration:

    # vi /usr/local/nagios/etc/minimal.cfg

    Add the following block at the end of the file:

    define command{
    command_name check_file_exist
    command_line $USER1$/check_file_exist.sh -F /tmp/nagios.chk
    }



  3. Add a new service to the localhost

    Each new service has to be defined once in the Nagios configuration and can be assigned to a single host, multiple hosts or even a host group. We assign it only to the localhost that is already defined in this base configuration:

    # vi /usr/local/nagios/etc/minimal.cfg

    Add the following block at the end of the file:

    define service{
    use generic-service
    host_name localhost
    service_description File check
    is_volatile 0
    check_period 24x7
    max_check_attempts 4
    normal_check_interval 5
    retry_check_interval 1
    contact_groups admins
    notification_options w,u,c,r
    notification_interval 960
    notification_period 24x7
    check_command check_file_exist
    }



  4. Verify Nagios configuration and restart it

    After all changes of the config files you should check the Nagios configuration and you have to restart Nagios after that:

    # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

    The Total Warnings and Total Errors should be 0 if you have done everything correct.
    So restart it with:

    # /etc/init.d/nagios restart


  5. Check if the new program is working

    First take a look at the tactical screen and you should see that one service is in status pending.
    That means no check was done before for this service.
    Wait a view minutes and it should disappear as pending and the number of OKs should increment from 5 to 6.

    Now create the file and watch the tactical screen, the service detail screen or the service problems screen.

    # touch /tmp/nagios.chk

    As we set the normal_check_interval to 5 minutes in the service definition, you should get the warning message during that time. Now add the executable attribute and watch:

    # chmod +x /tmp/nagios.chk

    The status should change during the check interval to critical.
    When you delete the file the service should return to status ok.

    # rm /tmp/nagios.chk

So that's all for the moment. I hope I have shown you a little bit about Nagios and how it works.

For me it's a great tool and it saves me a lot of time during the day-to-day business.

If you continue to work with it, there are a lot of things that could be made better with the configuration files. Please remember this is only a simple installation of it. If you would like I can write some more articles about it and how we manage our config files and what other check programs we added to the system. We even added user authentication for Nagios access with ldap to our eDirectory and so on.

Rainer Brunold




Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com

© 2014 Novell