7.8 Preventing Cascading Failovers

Cascading failover occurs when a bad cluster resource causes a server to fail, then fails over to another server causing it to fail, and then continues failing over to and bringing down additional cluster servers until possibly all servers in the cluster have failed.

Novell Cluster Services now incorporates functionality that detects if a node has failed because of a bad cluster resource and prevents that bad resource from failing over to other servers in the cluster.

This functionality is enabled by default when you install Novell Cluster Services. Cascading failover prevention can be disabled by adding the /hmo=off parameter to the clstrlib command in the sys:\system\ldncs.ncf file.

After adding the parameter, the line should appear as follows:

clstrlib /hmo=off

If you disable cascading failover prevention on one cluster server, you must do it on all servers in the cluster.

You must manually unload and reload Novell Cluster Services software on every cluster server in order for this change to take effect. To do this, use the uldncs command to unload cluster software and the ldncs command to reload cluster software.

Resource Quarantine

If cascading failover protection is enabled, a resource might be put into quarantine if it causes server abends for a three-day period. If Novell Cluster Services software determines that the resource is likely responsible for abends, and loading the resource will put the cluster in grave danger, it will cause the resource to go into a comatose state (quarantine it) rather than letting it load on (and potentially cause to fail) other cluster nodes.

The resource can still be manually brought online and manually migrated to other cluster nodes. To get the resource out of quarantine, you can disable cascading failover prevention. Cascading failover prevention can then be re-enabled by removing the clstrlib /hmo=off line from the sys:\system\ldncs.ncf file, then unloading and reloading Novell Cluster Services software.

Novell Cluster Services does the following to determine if a resource should be put into quarantine:

  1. Traces back the history of node failures for the suspected bad resource. This includes

    • What node the resource was running or loading on.

    • If the node failed.

    • The state the resource was in when the node failed.

    • If there were other resources trying to load when the node failed.

  2. Repeats the above process until one of the following happens

    • The end of the cluster log file is reached

    • Enough node failures are found

    • Found that the node did not fail

    • The entries in the log file are more than three days old

If the resource attempts to load on a node where it was previously loaded and there are additional nodes still available in the cluster, it will not be quarantined and will be allowed to load. Also, a resource is not quarantined when it is initially brought online.

Factors that might contribute to a resource being quarantined include:

Factors that might help prevent a resource from being quarantined include:

Resource quarantine is disabled if