Nagios: Host and Service Monitoring Tool
Novell Cool Solutions: Feature
By Rainer Brunold
Digg This -
Posted: 19 Jan 2006
Host and Service Monitoring Tool - quick overview and basic installation guide
What is Nagios ?
This is a sample screen from our installation / service problems view.
Basically Nagios (http://www.nagios.org) is an open source host, service and network monitoring tool. It let's you manage different types of services and hosts running on different operating systems like linux, netware, windows, aix ,.... It's flexible in configuration and can be extended as much as you want. It's configured within text files and managed with a web browser.
When you do a basic installation you get a set of Nagios check programs which let you start monitoring your first hosts and services, beginning with the installation in less than an hour.
Based on that default configuration you can start extending the configuration to your special needs.
can use the existing check programs or even add more check programs
if you take a look at http://www.nagiosexchange.org
where other developers put a lot of their check programs for
download. Or just write your own in any programming language that is
available on linux. Based on the services, you can setup time frames
for the monitoring process as well as notifications when alarms
arise. You can, but do not have to. It's always your choice and
that's great with Nagios. If you do not want to setup any
notifications, ignore its configuration, you can always watch the
service problems in the web browser. That is what I currently use.
It's good to come to the office in the morning take one look at the
Nagios service problems view and know what is going on. Or when I'll
do an eDirectory migration in our 260+ netware server tree Nagios is
open in the background to see which servers aren't reachable.
Here is a list of check programs in the base system:
check_breeze check_http check_nt check_ssh
check_by_ssh check_icmp check_ntp check_swap
check_dhcp check_ifoperstatus check_nwstat check_tcp
check_dig check_ifstatus check_oracle check_time
check_disk check_imap check_overcr check_udp
check_disk_smb check_ircd check_ping check_udp2
check_dns check_ldaps check_pop check_ups
check_dummy check_load check_procs check_users
check_file_age check_log check_radius check_wave
check_flexlm check_mailq check_real negate
check_fping check_mrtg check_rpc urlize
check_ftp check_mrtgtraf check_sensors utils.pm
check_hpjd check_nagios check_smtp utils.sh
At http://www.nagiosexchange.org you'll find a lot more if you are missing something here.
Here are some things we wanted to solve with Nagios:
We are running a lot of different management solutions (ZEN for server, IBM Tivoli, HP / Compaq System Insight Manager, McAfee ePolicy Orchestrator, ......) and all of them are in some parts very strong and powerful. The problem is, that some of them are really time consuming installations, configurations and administrations. It's problematic to add own services and hard to get a single view of all current services and the problems that existing.
So I went around looking for a solution that is even quiet easy to deploy but as powerful as possible to handle all our requirements. Of course I will try to drop out some other existing management solutions in our company for this, but I am sure it's not possible to drop them all for this one. A combination of them will be the best for us.
Here are some samples (beside a lot of default monitoring requirements like cpu, filesystem, ...) I had to deal with during the day to day administration and that should be able to automate with Nagios:
check Bordermanager SurfControl logfiles to see if the update happened
check ftp/tls and ftp/ssh server functionality with a real ftp user connect
check time synchronization on oes/linux server
check running virus scanner processes (like LinuxShield from McAfee)
check HP Proliant server status from the HP Proliant support pack agents
check DNS round robin functionality
check Vmware remote connection and web interface ports
check ZEN linux management mirror process for ZLM and RCD targets
check memory / swap usage on linux servers
check linux drbd synchronization status
check server availability on one screen during eg. eDirectory migrations
check ldap functionality
check web applications
check oes/linux cluster ressources
So how does it work ?
Just to say it in a simple way, you define a host by it's name and ip address and a view other parameters and assign it some services. Behind a service is nothing other than a check program configured that runs in a predefined interval some tests. Like the ftp services uses the check_ftp program to do some ftp connects to the server. It reports the result of it with a exit code back to Nagios. Exit code 0, if the test was okay = green, exit code 1 means warning = yellow, exit code 2 means critical = red and exit code 3 means unknown = orange.
Nagios itself shows you in the service problems view, the currently existing problems (hopefully this list is short or even empty) or in the service details a list of all configured services and their status.
Depending on your notification configuration it also notifies you with regarding information.
If you are interested in such a system monitoring tool the time is well worth to do a test installation and spend a view hours with it. We are monitoring currently 308 hosts and about 1058 services.
Additional configurations / extensions:
There are a lot of configurations / extensions that are not covered in this document. As you discover more and more of the possible configurations you will find things like how to put a web application link behind a host extra note. Like a host is down, just select it's extra host notes and you are forwarded directly to the eg. HP remote insight board. Or if a server shows some warnings from the HP Proliant agents, you can select the additional service tasks and come directly to the Proliant web management interface on the server. There is no need for us to enter the url for that web page anymore or even take a look at the HP system insight manager for server status. That's all in Nagios now.
I'll try to show you how you can setup a Nagios installation with a little linux knowledge in less than an hour or two. I think that's the best way to take a look at it. After that I'll show how to add a new check program for Nagios to monitor if a single file exists.
So here is a basic installation guide:
Unnecessary to say that you should do it on a test server first. If you use eg. PuTTY for ssh connect to your server you can copy / paste the commands from this document directly into the shell. So you do not have to write them yourself.
Do a default OES/Linux or SLES 9 - 32 bit installation.
Nagios does not need very much memory (less than 32mb) and disk space (less than 50mb). So it could be a ?small? server or even a virtual machine.
Install / remove some packages of your installation
OES and SLES contains Nagios version 1.2 in it's distribution but I'll decided to use the current version 2.0 rc1 from http://www.nagios.org. So I had to remove and install a view packages with yast:
remove if installed: nagios, nagios-nsca and nagios-plugins
install if not yet done: gd-devel and libpng-devel packages
and the whole Simple Webserver selection
Download the Nagios and the plugin tarball
Download the most recent Nagios and the official plugin tarball from the current Nagios version from http://www.nagios.org/download and copy it to /tmp on your test server.
Right now there is the Nagios version 2.0rc1 available. Normally I prefer to use only rpms for installation, but this time I use the tarball so I can configure some additional parts during compilations and installation. If tried this installation procedure with other 2.0x version and it was working well.
current nagios tarball: nagios-2.0rc1.tar.gz
current plugin tarball: nagios-plugins-1.4.1.tar.gz
Create the Nagios user and group
You're probably going to want to run Nagios under a normal user account, so add a new user and group to your system with the following command:
# useradd -m nagios
# groupadd nagios
Create the installation directory
Create the base directory where you would like to install Nagios as follows...
# mkdir /usr/local/nagios
Change the owner of the base installation directory to be the Nagios user and group you added earlier as follows:
# chown nagios.nagios /usr/local/nagios
Identify the web server user
You're probably going to want to issue external commands (like acknowledgements and scheduled downtime) from the web interface. To do so, you need to identify the user your web server runs as (typically wwwrun). This setting is found in your web server configuration file. The following command can be used to determine quickly what user Apache is running as:
# grep -R "^User" /etc/apache2/*
Normally the user is the wwwrun. We will add it to the Nagios group in the next step. If the user differs, be sure to use in step 7 the right one.
Add a command file group
Next we're going to create a new group whose members include the user your web server is running as and the user Nagios is running as. Let's say we call this new group 'nagcmd':
# groupadd nagcmd
Next, add the users that your web server and Nagios run as to the newly created group with the following commands:
# usermod -G nagcmd wwwrun
# usermod -G nagcmd nagios
Extract the Nagios tarball
# cd /tmp
# tar -xvzf nagios-2.0rc1.tar.gz
# cd nagios-2.0rc1
Compile the Nagios package
Run the configure script to initialize variables and create a Makefile as follows:
# ./configure --prefix=/usr/local/nagios --with-cgiurl=/nagios/cgi-bin --with-htmurl=/nagios --with-nagios-user=nagios --with-nagios-group=nagios ?with-command-group=nagcmd
Compile the binaries
Compile Nagios and the CGIs with the following command:
# make all
Installing the binaries and HTML files
Install the binaries and HTML files (documentation and main web page) with the following command:
# make install
Installing an init script
If you want, you can also install the init script /etc/init.d/nagios with the following command:
# make install-init
Installing command mode
If you want, you can also install the command mode environment with the following command:
# make install-commandmode
Installing sample config files
Now we install some default configuration file, that have to be changed a little bit later:
# make install-config
Installing the plugins
Plugins are usually installed in the libexec/ directory of your Nagios installation (i.e. /usr/local/nagios/libexec). Plugins are scripts or binaries which perform all the service and host checks that constitute monitoring.
# cd /tmp
# tar -xvzf nagios-plugins-1.4.1.tar.gz
# cd nagios-plugins-1.4.1
# make install
Setup the Apache web interface
To make Nagios accessible through the apache web server we have to setup a config file for it. Create the config file as follows:
# vi /etc/apache2/conf.d/nagios.conf
Insert this elements into that new file:
ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin
Allow from all
Alias /nagios /usr/local/nagios/share
Allow from all
After that restart the apache web server with the following command:
# rcapache2 restart
Setup a minimum Nagios configuration
# cd /usr/local/nagios/etc
# cp cgi.cfg-sample cgi.cfg
# cp nagios.cfg-sample nagios.cfg
# cp minimal.cfg-sample minimal.cfg
# cp resource.cfg-sample resource.cfg
For proper use of this Nagios configuration we have to create two additional, empty config files. Do not copy the sample files for this one, there would be duplicate command definitions.
# touch checkcommands.cfg
# touch misccommands.cfg
The last configuration steps ...
Set the Nagios user and group as owner of the Nagios installation:
# chown -R nagios.nagios /usr/local/nagios
Deactivate the authentication for the cgi's:
# vi cgi.cfg
Search for the line ?use_authentication=1? and change it to ?use_authentication=0?. That's for testing easier to handle.
But not all functions are possible if it's disabled.
For production use later it should be activated but then you have to configure some other parts of Nagios as well.
Verify the Nagios configuration and start it
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
The ?Total Warnings? and ?Total Errors? should be 0 if you have done everything correct.
If so just start it the first time:
# /etc/init.d/nagios start
Activate Nagios to start within the runlevel scripts automatically
# insserv nagios
Test Nagios access with a web browser
http://<servername or ip address>/nagios
You should see the following:
Now you can start exploring Nagios yourself.
to know: All screens refresh every 30 seconds, no need to reload
That time can be changed even to lower values in the Nagios configuration files.
Just let me show you the most important views:
1. Tactical Overview:
This screen gives you an overview of the current status of the monitored services and hosts. Take a look at the ?Hosts? and you see that you are currently monitoring only one host. In the ?Services? line you see that you are monitoring 5 services and maybe all are reporting the status ?OK?. As a summary the ?Network Health? - Host and Service health bar is filled completely with green, indicating all configured hosts and services are OK.
Most fields on this page are links to more specific views. If you want to know more about your 5 monitored services you can either klick on the ?5 OK? filed, or choose ?Service Detail? on the left.
2. Service Detail
This screen shows you all the services configured to be monitored and their current status.
As we saw in the tactical screen here are our five services we monitor right now.
You can see basic informations about each service on this page:
Host the host to which this services are configured
If this field is marked red, the host itself is down,
if it's just grey the server is up and reachable with ping.
Status show the current status of the service
OK = green
Warning = yellow
Critical = red
Unknown = orange
Last Check date and time when it has been checked the last time
Duration shows for how long the service in this status
Attempt how many attempts were needed for the check
Status Information this is the output from the check program
Again if you want to know more about a single service, select it by its name and you are redirected to a more detailed page about it.
This is the same view as the service detail, showing the details of the monitored hosts. Therefore I have no screen shot of it. You would see all configured hosts and have again the choice to select one to get more informations about it.
4. Service Problems
Hopefully this screen will be empty as long as possible. This is the screen that I have opened the day long. On top of this document is an actual screen shot of our system. There are some service problems. Hmm.. something to do for me ... Whenever a service reports a failure you will get the information on this page. The browser refreshes also every 30 seconds and you get the current list of failed services.
When a service reports a failure the line will be shown here. The interval a service should be checked can be configured in the Nagios configuration. The minimum interval is 1 minute. When the next check reports everything is okay for that service, it will disappear from this list. So this is the page where you can see the failures of your monitored services right now and even actual.
Last but not least for this article I want to show you how you can add another service with a new check program. I think this is the best way to understand the configuration for the hosts and services.
show you how you could add new check programs on your own, here is an
We add a simple bash script that checks if the file /tmp/nagios.chk is available. If it is there and it's executable the service goes to critical, if it is there and not executable it's going to warning and if it doesn't exist the service is ok.
Create the executable check file
# vi /usr/local/nagios/libexec/check_file_exist.sh
Add the following to that file:
# Check if a local file exist
while getopts F: VAR
case "$VAR" in
F ) LOGFILE=$OPTARG ;;
* ) echo "wrong syntax: use $o -F <file to check>"
exit 3 ;;
if test "$LOGFILE" = ""
echo "wrong syntax: use $0 -F <file to check>"
# Nagios exit code 3 = status UNKNOWN = orange
if test -e "$LOGFILE"
if test -x "$LOGFILE"
echo "Critical $LOGFILE is executable !"
# Nagios exit code 2 = status CRITICAL = red
echo "Warning $LOGFILE exists !"
# Nagios exit code 1 = status WARNING = yellow
echo "OK: $LOGFILE does not exist !"
# Nagios exit code 0 = status OK = green
Now set the file attributes:
# chown nagios.nagios /usr/local/nagios/libexec/check_file_exist.sh
# chmod +x /usr/local/nagios/libexec/check_file_exist.sh
Add the check program to the nagios configuration
Each new check command has to been defined once in the global Nagios configuration:
# vi /usr/local/nagios/etc/minimal.cfg
Add the following block at the end of the file:
command_line $USER1$/check_file_exist.sh -F /tmp/nagios.chk
Add a new service to the localhost
Each new service has to be defined once in the Nagios configuration and can be assigned to a single host, multiple hosts or even a host group. We assign it only to the localhost that is already defined in this base configuration:
# vi /usr/local/nagios/etc/minimal.cfg
Add the following block at the end of the file:
service_description File check
Verify Nagios configuration and restart it
After all changes of the config files you should check the Nagios configuration and you have to restart Nagios after that:
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
The Total Warnings and Total Errors should be 0 if you have done everything correct.
So restart it with:
# /etc/init.d/nagios restart
Check if the new program is working
First take a look at the tactical screen and you should see that one service is in status pending.
That means no check was done before for this service.
Wait a view minutes and it should disappear as pending and the number of OKs should increment from 5 to 6.
Now create the file and watch the tactical screen, the service detail screen or the service problems screen.
# touch /tmp/nagios.chk
As we set the normal_check_interval to 5 minutes in the service definition, you should get the warning message during that time. Now add the executable attribute and watch:
# chmod +x /tmp/nagios.chk
The status should change during the check interval to critical.
When you delete the file the service should return to status ok.
# rm /tmp/nagios.chk
So that's all for the moment. I hope I have shown you a little bit about Nagios and how it works.
For me it's a great tool and it saves me a lot of time during the day-to-day business.
If you continue to work with it, there are a lot of things that could be made better with the configuration files. Please remember this is only a simple installation of it. If you would like I can write some more articles about it and how we manage our config files and what other check programs we added to the system. We even added user authentication for Nagios access with ldap to our eDirectory and so on.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com