Novell Home

Nagios 3.0 Extension - NagiosGraph

Novell Cool Solutions: Feature
By Rainer Brunold

Digg This - Slashdot This

Posted: 26 Oct 2007
 

Products:

  • Open Enterprise Server
  • SUSE Linux Enterprise Server
  • SUSE Linux Enterprise Desktop
  • ZENworks

Chapter 2 - Integrating NagiosGraph

Continuing the last article from the Cool Solutions which led you through a basic Nagios installation it's time to add some extensions to it. In my job as a system architect and administrator I have often the problem that I do not know the history or the trends of a specific service during the day or night.

So I was looking for a graph solution that integrates into Nagios and found NagiosGraph to be a very easy way to do that. There are much more available at http://www.nagiosexchange.org in the Categories / AddOn / Charts section. You might have a look there. I concentrate on this one.

This is a sample screen shot from the NTP service which show how much the time on the local server drifts. There is normally a fourth diagram at this page showing the yearly graph, I just cut that off so that it is not to long.

How does this work?

As part of the NagiosGraph installation and configuration modifications to the Nagios main configuration files are necessary. We enable Nagios to write all service check results to a external file called perfdata.log and add a performance processing command that is started every 30 seconds. That performance processing command is part of the NagiosGraph packages and goes through the perfdata.log file and compares the entries against a reference map file. The map file contains some regular expressions and when the first expression matches, NagiosGraph picks the appropriate values from that check result and writes it to a round robin database. It uses the programs from the rrdtools package for that. If that database file doesn't exists, it creates them on the fly. Depending on the number of columns in that file the size will vary between 30 and maybe 100kB per service graph.

So the check results are now stored in a database file. NagiosGraph provides a show.cgi script that can be added to the Nagios service definitions that build based on that data the charts. The show.cgi accepts some parameters which allow to customize the chart. We will do some customization later in this article.

When you enable the performance data processing for several hosts and services, NagiosGraph will create for each service on each host a separate database file.

Here is the guide on how you install and and configure NagiosGraph:

  1. Server Preparation

    NagiosGraph needs the perl and rrdtool packages to be installed on the server.

  2. As this is part of a default server installation no steps are required.
  3. Software Download and Extraction
  4. There is only one single package required for the NagiosGraph installation:

    Software Download Link Current File Name by 05/10/2007
    Nagiosgraph 0.9 http://sourceforge.net/projects/nagiosgraph/ nagiosgraph-0.9.0.tgz

    Download that package and copy it to a temporary installation directory. I use /images for those steps.

    # mkdir /images
    # cp <nagiosgraph-0.9.0.tgz> /images
    # cd /images
    # tar -xvzf nagiosgraph-0.9.0.tgz
  5. Installation of NagiosGraph
  6. NagiosGraph needs no compilation because it consists of dynamically executed perl and cgi scripts. So we just have to copy them to the right locations.

    As for Nagios I follow the LSB rules and select /opt/nagiosgraph as the location of that program.

    # cd /images/nagiosgraph-0.9.0
    # mkdir /opt/nagiosgraph
    # cp nagiosgraph.conf map show.cgi insert.pl /opt/nagiosgraph
    # cp nagiosgraph.css /opt/nagios/share/stylesheets

    Now prepare some directories for log and database files. The log files for NagiosGraph are stored in /var/opt/nagiosgraph, the round robin database files are stored in the rrd subdirectory below.

    # mkdir -p /var/opt/nagiosgraph/rrd
    # touch /var/opt/nagiosgraph/nagiosgraph.log
    # chown -R nagios.nagcmd /opt/nagiosgraph
    # chown -R nagios.nagcmd /var/opt/nagiosgraph
    # chmod 2775 /var/opt/nagiosgraph
    # chmod 664 /var/opt/nagiosgraph/nagiosgraph.log
  7. Configuration Changes for Nagios
  8. As described before we have to configure Nagios to write all service check results to a external log file that is parsed by the performance processing command. We enable here the Nagios parameter "process_performance_data" in general for all service checks defined in Nagios. If you have a set of services where you do not need those charts, you can add the same parameter to the service definition and disable it explicit for that one. If a service result is written to the perfdata.log and no regular expression from the map file matches, that entry is dropped. So nothing happens with it. As long as you have a small number of services there is no need to disable that for a view services.

    First please make the following modifications to the Nagios main configuration file. Most of this settings are commented out, please activate them and set the listed values.

    # vi /opt/nagios/etc/nagios.cfg
    ...
    process_performance_data=1
    ...
    service_perfdata_file=/var/opt/nagios/perfdata.log
    ...
    service_perfdata_file_template=$LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||SERVICEPERFDATA$
    ...
    service_perfdata_file_mode=a
    ...
    service_perfdata_file_processing_interval=30
    ...
    service_perfdata_file_processing_command=process-service-perfdata
    ...

    As for all external actions Nagios can do, we have to make a definition for the "service_perfdata_file_processing_command" in the commands.cfg configuration file. The value of that parameter (process-service-perfdata) is just a symbolic name for the script that will be executed. Exact definition is in the commands.cfg.

    In the default configuration there is already a definition for the command_name process-service-perfdata.
    Please change the command_line of it to point to the NagiosGraph insert.pl script.

    # vi /opt/nagios/etc/objects/commands.cfg
    ...
    define command {
      command_name   process-service-perfdata
      command_line   /opt/nagiosgraph/insert.pl
    }
    ...

    Now we have to activate those changes in Nagios.

    A normal rule is that when you make modifications to the Nagios configuration files you have to restart Nagios. When you make modifications to the objects files, you just have to reload Nagios. In this case we have to restart Nagios because we made changes to the nagios.cfg. That restart will also activate the changes made to the command configuration files.

    # /etc/init.d/nagios restart

    Immediately after the restart, Nagios will write new service check results to the perfdata.log file. As we have not completed the NagiosGraph configuration itself, NagiosGraph will not parse that file. So it will grow a little bit (not very much) till it gets parsed.

  9. 5. Configuration of NagiosGraph
  10. First configure the directory structure in the NagiosGraph main configuration file.

    # vi /opt/nagiosgraph/nagiosgraph.conf
    ...
    logfile = /var/opt/nagiosgraph/nagiosgraph.log
    ...
    rrddir =  /var/opt/nagiosgraph/rrd
    ...
    mapfile = /opt/nagiosgraph/map
    ...
    perflog = /var/opt/nagios/perfdata.log
    ...

    Next the performance processing command insert.pl has to point to the NagiosGraph configuration file:

    # vi /opt/nagiosgraph/insert.pl
    ...
    my $configfile = '/opt/nagiosgraph/nagiosgraph.conf';
    ...
    

    And last the show.cgi that will provide us the charts needs also to point to the NagiosGraph configuration file:

    # vi /opt/nagiosgraph/show.cgi
    ...
    my $configfile = '/opt/nagiosgraph/nagiosgraph.conf';
    ...
  11. Configuration Changes for Apache
  12. We have to allow apache to access the show.cgi to provide us the chart page. Therefor we add a new file to the apache configuration directory /etc/apache2/conf.d.

    # vi /etc/apache2/conf.d/nagiosgraph.conf
    
    ScriptAlias /nagiosgraph/ /opt/nagiosgraph/
    
    <Directory "/opt/nagiosgraph">
    #  SSLRequireSSL
       Options None
       AllowOverride None
       Order allow,deny
       Allow from all
    #  Order deny,allow
    #  Deny from all
    #  Allow from 127.0.0.1
       AuthName "Nagios Access"
       AuthType Basic
       AuthUserFile /opt/nagios/etc/htpasswd.users
       Require valid-user
    </Directory>

    To activate that apache configuration we have to restart it.

    # rcapache2 restart
  13. Check Performance Data Processing
  14. As Nagios is still running in the background and processing service checks, check results are written to the perfdata.log in /var/opt/nagios. Now where we have finished the NagiosGraph basic configuration that data should be parsed and because of some default map entries in NagiosGraph is getting processed.

    So when we now check the perfdata.log you should see that it sometimes grow a little bit, that is when Nagios has processed a service check and has written the results to it. At least 30 seconds after that the file should be empty, because the Nagios service_perfdata_file_processing_interval has been reached, started the appropriate command that has parsed all data in perfdata.log and written for the matching lines the data to the round robin database in /var/opt/nagiosgraph/rrd/localhost. So take a look there and you should see that some files exist. They only exist because NagiosGraph has a default map file that contains some regular expressions that already match our default service check results we get from the Nagios default configuration.

    Please check that you have the following Current%20Load___load.rrd in your /var/opt/nagiosgraph/rrd/localhost directory because we create the chart for that in the next section. If not, wait a view minutes till the next Nagios check occurs and the service data get's processed and then the file should exist.

  15. Add Charts to the Nagios Web Frontend
  16. The next step is to add a chart to the service entry in Nagios. We would like to go from this view

    to this one:

    NOTE: he load on this machine seems to be very low.
    I will have to generate some load so we see something in the charts after that !

    First let's check the service definition for the "Current Load" service.
    We find that definition in the /opt/nagios/etc/objects/localhost.cfg.

    # less /opt/nagios/etc/objects/localhost.cfg
    ...
    define service{
           use                   local-service      ; Name of service template to use
           host_name             localhost
           service_description   Current Load
           check_command         check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
           }
    ...

    Here is a short description for the parameters of definition:

    use A complete service definition requires about 12 parameters. That makes each service definition a little bit longer as it should be and contains mostly the same parameters for different service definitions. Therefor Nagios can use templates to bring them together into a single definition. Adding the parameter "register 0" to it will mark them as a template which can be use by the real service definitions. In this case all parameters from the template "local-service" are inherited. You can add the same parameters that are defined in the template here again and can overrule some template values if you want. Use templates whenever possible !
    hostname This defines for which host this service should be checked. This can be a single or comma separated list of more hosts. If you have a larger environment you should collect same host types together to a host group and assign the service to the host group instead of each single host. Use host groups whenever possible !
    service_description This is the name of the service that you also see in the Nagios web front end.
    If you have a larger environment with different types of operating systems it makes sense to standardize the service description like
    L:OS:LOAD for Linux / Operating System / Load or
    W:OS:CPU for Windows / Operating System / CPU or
    L:FS:ROOT for Linux / Filesystem / Root Filesystem.
    This will help you when you draw graphs for all that different type of hosts.
    check_command This is the check command that is executed when Nagios checks this service.
    The definition for this command is done in the commands.cfg.
    There you will find the command_line for it which contains the real program that is executed. Parameters that are following the check_command name here are referenced in the comamnds.cfg using the $ARG1$ and $ARG2$. The "!" is the parameter separator.
    The $USER1$ in front of the command_line in the commadns.cfg is defined in the resources.cfg and points to /opt/nagios/libexec.
    So in this case the real executed command would be:
    /opt/nagios/libexec/check_load -w 5.0,4.0,3.0 -c 10.0,6.0,4.0

    Something that is not covered in the default Nagios configuration files are the serviceextinfo definitions.

    They allow to show icons behind the service name, and allow to define two links to other web applications. Those two definitions are the notes_url and the action_url. You can eg. use the notes_url to point to NagiosGraph trend charts and you can use the action_url to point to a web page behind that service. A sample for the action_url is that we check the availability of the IBM TSM backup software web page from a service and the action_url points exactly to that page. So I do not have to enter the url when I would like to go there, I just have to click on that small icon.

    So now we add such a serviceextinfo definition for our "Current Load" service.

    I would place that definition right behind the service definition of the service itself to keep things together.

    Take care that the brackets are closed after each definition !

    # less /opt/nagios/etc/objects/localhost.cfg
    ...
    define service{
           use                   local-service      ; Name of service template to use
           host_name             localhost
           service_description   Current Load
           check_command         check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
           }
    
    define serviceextinfo{
           host_name             localhost
           service_description   Current Load
           notes_url                          /nagiosgraph/show.cgi?host=$HOSTNAME$&service=$ SERVICEDESC$&db=load,avg1min,avg5min,avg15min&geom=500x100&rrdopts=%2Dl%200%20%2Du %2010%20%2Dt%20CPU%2DLoad
           }
    ...

    The host_name and service_description have to match the service definition.

    Here is the description for the notes_url string:

    /nagiosgraph/show.cgi This is the url where that link will point to. In this case to the local server where we placed the show.cgi script. You can verify that in the apache configuration we did before.
    host=$HOSTNAME$&service=$SERVICEDESC$ This are the first two parameter that are passed to the show.cgi. That script has to know which round robin database it has to use. Remember that there might be a lot of them. The $HOSTNAME$ and $SERVICEDESC$ are Nagios internal macros where it will place the appropriate values when the link is built for the web page. So this would look like hostname=localhost&service=Current%20Load (%20 is a encoded space)
    db=load,avg1min,avg5min,avg15min The next parameter tell the show.cgi which database it has to use (load) and what columns (avg1min, avg5min and avg15min) it has to display. That database and column definitions are made in the NagiosGraph map file which we will check a little bit later.
    geom=500x100 This parameter defines the size of each diagram. The show.cgi will automatically draw a daily, weekly, monthly and yearly diagram in this size.
    rrdopts=%2Dl%200%20%2Du %2010%20%2Dt%20CPU%2DLoad This one sets some more diagram specific options for the show.cgi. Spaces are encoded with %20, dashes with %2D. So converting the string will result in:
    rrdopts=-l 0 -u 10 -t CPU-Load
    -l ... lower limit of the diagram at the y scale
    -u ... upper limit of the diagram at the y scale
    -t ... title of the diagram
    So we will get a diagram where the y scale goes from 0 to 10 and the title above the diagram will be CPU-Load. If you do not add a upper limit, that will be calculated automatically. For CPU charts it's interesting to the the lower at 0 and the upper to 100.
    As the show.cgi utilizes the rrdgraph program in the background to draw that charts you should be able to use most of the rrdgraph options here. Here is a link to that options:
    http://oss.oetiker.ch/rrdtool/doc/rrdgraph.en.html

    After you have added that definition to the localhost config file you have to reload Nagios to activate the changes:

    # /etc/init.d/nagios reload

    The next time you go to the service details page you should see that link beside the "Current Load" service.

    By clicking on it a new browser window should be opened which shows you the diagrams.

    Because we just activated the performance data processing there might not be much data in the chart visible right now. In that case you have to wait some time and come back to check it later.

    It might also be possible that there are no diagrams at the beginning. This might happen when you did the configuration steps so fast that the service wasn't checked and therefor no round robin database was created till now. So don't worry, just wait a few minutes and try again. In the default Nagios configuration files that service is checked every 5 minutes.

  17. NagiosGraph Map File
  18. I think now we have done most of the configurations, the only thing we have to cover is the map file that contains the regular expression which have to match to get the round robin database created. That file is locates in /opt/nagiosgraph. Let's search for the definition of the load that creates the database for this specific service for us.

    # less /opt/nagiosgraph/map
    ...
    # Service type: unix-load
    #   output: OK - load average: 0.66, 0.70, 0.73
    #   perfdata:load1=0;15;30;0 load5=0;10;25;0 load15=0;5;20;0
    /output:.*load average: ([.0-9]+), ([.0-9]+), ([.0-9]+)/
    and push @s, [ load,
                   [ avg1min,  GAUGE, $1 ],
                   [ avg5min,  GAUGE, $2 ],
                   [ avg15min, GAUGE, $3 ] ];
    ...

    So let me explain how this works. In the last section with adding the charts to the Nagios web front end we analyzed that the real check command line behind the "Current Load" has the following syntax:

    /opt/nagios/libexec/check_load -w 5.0,4.0,3.0 -c 10.0,6.0,4.0

    Go to the ssh session of your server and run that command. You should get some similar output:

    # /opt/nagios/libexec/check_load -w 5.0,4.0,3.0 -c 10.0,6.0,4.0
    OK - load average: 0.01, 0.02, 0.03|load1=0.010;5.000;10.000;0; load5=0.020;4.000;6.000;0; load15=0.030;3.000;4.000;0;

    The output contains two different outputs. The first part is the "Status Information", referenced in the map file as output, everything after the "|" is the "Performance Data" that is referenced as perfdata in the map file. You find this information also in the service details of the Nagios web front end.

    Now imagine Nagios has processed this command and as we have the process performance data activated the whole string is written to the perfdata.log beside some additional parameters. After the processing interval has been reached NagiosGraph picks each line from the perfdata.log and goes through the map file and searches for a regular expression that matches. Attention, the first one that matches will be executed and no further one is processed.

    The regular expression now can look into the "Status Information" or the "Performance Data".

    The following one here points to output which means the "Status Information".

    /output:.*load average: ([.0-9]+), ([.0-9]+), ([.0-9]+)/
    OK - load average: 0.01, 0.02, 0.03|load1=0.010;5.000;10.000;0; load5=0.020;4.000;6.000;0; load15=0.030;3.000;4.000;0;
    ".*" matches "OK - "
    "load average: " matches exactly
    "([.0-9]+)" matches any number containing 0-9 and ., in this case "0.01"
    ", " - matches exactly
    "([.0-9]+)" matches any number containing 0-9 and ., in this case "0.02"
    ", " - matches exactly
    "([.0-9]+)" matches any number containing 0-9 and ., in this case "0.03"

    So this rule was found that it matches and the following part in the map files defines how to name the database and how to name the columns:

    and push @s, [ load,
                   [ avg1min,  GAUGE, $1 ],
                   [ avg5min,  GAUGE, $2 ],
                   [ avg15min, GAUGE, $3 ] ];

    "load" is the database name, "avg1min", "avg5min" and "avg15min" are the columns.

    "$1" till "$3" reference the decimal value in the string, so "$1" refers to "0.01", "$2" to "0.02" and so on.
    "GAUGE" defines the type of data it is and means write exact value to the database.

    There are a few other types available like "COUNTER" and "DERIVE". Imagine you check a service that reports a counter back that always increases. But you do not want to write the counter itself to the database, you want to write the difference to the last value to the database. This is a sample where you need other types. A description for them can be found here: http://oss.oetiker.ch/rrdtool/doc/rrdcreate.en.html

    It is also possible to do some arithmetic operations when writing the values to the database. You can find some samples in the map file itself. This is helpful when you convert kBytes or MBytes into bytes. The show.cgi does a automatic scaling of the values in the diagram and you will get the kBytes or MBytes back. Imagine you write values like 1560 kBytes without recalculating into bytes into the database. The chart will show you then the value as 1.56k but as the initial value was already in kBytes that should be 1.56MBytes. So writing the value as Bytes into the database will result in a diagram that has 1.56M on it.

    So I think that would be enough for the moment. I hope you got enough information to understand how it works and you can try this steps for other services on your own.

    The last part here will give you just a few additional hints for NagiosGraph:

  19. Special Hints
  20. If you have some problems getting your graphs working it might be helpful to increase the debug level in the /opt/nagiosgraph/nagiosgraph.log. The output will be written to /var/opt/nagiosgraph/nagiosgraph.log. Don't forget to set the debug level back when you found the problem otherwise the file will grow very much.

    If you think the problem might be apache related take a look at it's log files in /var/log/apache2.

    The round robin database is created when the first value is written to it. When you decide later to put another value from the service output into it, that you hadn't defined to write in before, that will not work. Because the database is created with exactly that definition from the first values, you cannot add on the fly more of them. In this case you have to delete the database file so it gets recreated the next time. All data till that time for that service is lost.

    NagisoGraph is not usable for SLA charts. Because data gets correlated when values are written into the daily, weekly and yearly charts you will loose absolute values and get average values.

    Try to define service or performance outputs as detailed as possible so that only one regular expression matches. Otherwise the data might be written to a different database.

    If you have a syntax error in the map file it might be the case that also no other data is written to all the database files till you correct the problem. Take care of that and check the modification date of the database files after you made larger changes to the map file. If the modification date changes data is still written to them. If you define a chart title in the notes_url that contains more words (eg. CPU Load), no spaces are allowed between them. The rrdopts in the background would interpret the word after the space (Load) as the next parameter. So use dashes to split the words. (CPU-Load, and as dashes have to be translated to %2D this will result in CPU%2DLoad).

Rainer Brunold


Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com

© 2014 Novell