Novell Doc: OES 2 SP3: Novell Cluster Services 1.8.8 Administration Guide for Linux - Enabling Monitoring and Configuring the Monitor Script

10.6 Enabling Monitoring and Configuring the Monitor Script

Resource monitoring allows Novell Cluster Services to detect when an individual resource on a node has failed independently of its ability to detect node failures. Monitoring is disabled by default. It is enabled separately for each cluster resource.

10.6.1 Understanding Resource Monitoring

When you enable resource monitoring, you must specify a polling interval, a failure rate, a failure action, and a timeout value. These settings control how error conditions are resolved for the resource.

Polling Interval

The monitoring script runs at a frequency specified by the polling interval. By default, it runs every minute when the resource is online. You can specify the polling interval in minutes or seconds. The polling interval applies only to a given resource.

Failure Rate

The failure rate is the maximum number of failures (Maximum Local Failures) detected by the monitoring script during a specified amount of time (Time Interval).

A failure action is initiated when the resource monitor detects that the resource fails more times than the maximum number of local failures allowed to occur during the specified time interval. For failures that occur before it exceeds the maximum, Cluster Services automatically attempts to unload and load the resource. The progress and output of executing a monitor script are appended to /var/opt/novell/log/ncs/resource_name.monitor.out file.

For example, if you set the failure rate to 3 failures in 10 minutes, the failure action is initiated if it fails 4 times in a 10 minute period. For the first 3 failures, Cluster Services automatically attempts to unload and load the resource.

Failure Action

The Failover Action indicates whether you want the resource to be set to a comatose state, to migrate to another server, or to reboot the hosting node (without synchronizing or unmounting the disks) if a failure action initiates. The reboot option is normally used only for a mission-critical cluster resource that must remain available.

If the failure action initiates and you chose the option to migrate the resource to another server, the resource migrates to the next server in its Assigned Nodes list, which you previously ordered according to your preferences. The resource remains on the server it has migrated to unless you migrate it to another server or the failure action initiates again, in which case it again migrates to the next server in its Assigned Nodes list.

If the failure action initiates and you chose the option to reboot the hosting node without synchronizing or unmounting the disks, each of the resources on the hosting node will fail over to the next server in its Assigned Nodes list because of the reboot. This is a hard reboot, not a graceful one.

With resource monitoring, the Start, Failover, and Failback Modes have no effect on where the resource migrates. This means that a resource that has been migrated by the resource monitoring failure action does not migrate back (fail back) to the node it migrated from unless you manually migrate it back.

Timeout Value

The timeout value determines how much time the script is given to complete. If the script does not complete within the specified time, the configured failure action is initiated. Cluster Services marks the process as failed right after the defined timeout expires, but it must wait for the process to conclude before it can start other resource operations.

The timeout value is applied only when the resource is migrated to another node. It is not used during resource online/offline procedures.

How Resource Monitoring Works

The monitoring script runs at the frequency you specify as the polling interval.
There are two conditions that trigger a response by Novell Cluster Services:
- An error is returned. Go to Step 3.
- The script times out, and the process fails. Go to Step 4.
Novell Cluster Services tallies the error occurrence, compares it to the configured failure rate, then does one of the following:
- Total errors in the interval are less than or equal to the Maximum Local Failures: Novell Cluster Services tries to resolve the error by offlining the resource, then onlining the resource.
  
  If this problem resolution effort fails, Novell Cluster Services goes to Step 4 immediately regardless of the failure rate condition at that time.
- Total errors in the interval are more than the Maximum Local Failures: Go to Step 4.
Novell Cluster Services initiates the configured failure action. Possible actions are:
- Puts the resource in a comatose state
- Migrates the resource to another server
- Reboots the hosting node (without synchronizing or unmounting the disks)

10.6.2 Configuring Resource Monitoring

The resource monitoring function allows you to monitor the health of a specified resource by using a script that you create or customize. If you want Novell Cluster Services to check the health status of a resource, you must enable and configure resource monitoring for that resource. Enabling resource monitoring requires you to specify a polling interval, a failure rate, a failure action, and a timeout value.

If you are creating a new cluster resource, the Monitor Script page should already be displayed. You can start with Step 5.

In iManager, click Clusters, then click Cluster Options.
Browse to locate and select the Cluster object of the cluster you want to manage.
Select the check box next to the resource that you want to configure monitoring for, then click the Details link.
Click the Monitoring tab.
Select the Enable Resource Monitoring check box to enable resource monitoring for the selected resource.

Resource monitoring is disabled by default.
For the polling interval, specify how often you want the resource monitoring script for this resource to run.

You can specify the value in minutes or seconds.
Specify the number of failures (Maximum Local Failures) for the specified amount of time (Time Interval).

For information, see Failure Rate.
Specify the Failover Action by indicating whether you want the resource to be set to a comatose state, to migrate to another server, or to reboot the hosting node (without synchronizing or unmounting the disks) if a failure action initiates. The reboot option is normally used only for a mission-critical cluster resource that must remain available.

For information, see Failure Action.
Click the Scripts tab, then click the Monitor Script link.
Edit or add the necessary commands to the script to monitor the resource on the server.

The resource templates included with Novell Cluster Services for Linux include resource monitoring scripts that you can customize.

You can use the same commands that would be used at the Linux terminal console. For example, see Section 10.6.4, Monitoring Services That Are Critical to Clustering.
Specify the Monitor Script Timeout value, then click Apply to save the script.

The timeout value determines how much time the script is given to complete. If the script does not complete within the specified time, the failure action you chose in Step 8 initiates.
Do one of the following:
- If you are configuring a new resource, click Next, then continue with Section 10.7.2, Setting the Start, Failover, and Failback Modes for a Resource.
- Click Apply to save your changes.
  
  Changes for a resource’s properties are not applied while the resource is loaded or running on a server. You must offline, then online the resource to activate the changes for the resource.

10.6.3 Example Monitoring Scripts

The resource templates included with Novell Cluster Services for Linux include resource monitoring scripts that you can customize.

Example monitor scripts are available in the following sections:

10.6.4 Monitoring Services That Are Critical to Clustering

Monitoring scripts can also be used for monitoring critical services needed by the resources, such as Linux User Management (namcd) and Novell eDirectory (ndsd). However, the monitoring is in effect only where the cluster resource is running.

IMPORTANT:The monitor script runs only on the cluster server where the cluster resource is currently online. The script does not monitor the critical services on its assigned cluster server when the resource is offline. The monitor script does not monitor critical services for any other cluster node.

For example, to monitor whether the namcd and ndsd services are running, add the following commands to the Monitor script:

# (optional) status of the eDirectory service
exit_on_error rcndsd status 

# (optional) status of the Linux User Management service 
exit_on_error rcnamcd status

You can use the namcd status command instead of rcnamcd status in the Monitor script if you want to automatically restart namcd if it is not loaded and running. However, namcd creates messages in /var/log/messages with each check.