Dead IP Address resource fails to restart or migrate

This document (7012073) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 11
SUSE Linux Enterprise High Availability Extension 11

Situation

An IPaddr or IPaddr2 ocf resource failed to restart or migrate after the IP address dropped and stopped working.

The cibadmin -Q shows the cib database entry as:


<primitive class="ocf" id="db_ip" provider="heartbeat" type="IPaddr2">
  <instance_attributes id="db_ip-instance_attributes">
    <nvpair id="db_ip-instance_attributes-ip" name="ip" value="150.150.150.150"/>
    <nvpair id="db_ip-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
    <nvpair id="db_ip-instance_attributes-nic" name="nic" value="eth0"/>
  </instance_attributes>
  <operations/>
</primitive>

Resolution

Create a monitor operation for the resource to monitor it's health. Generally, each resource should have a monitor operation to monitor the health of the resource. From the example above, you could create a default monitor operation with the following command:

crm configure monitor db_ip 10s:20s

# crm configure monitor
usage: monitor <rsc>[:<role>] <interval>[:<timeout>]

When added, the cibadmin -Q output would look like:

<primitive class="ocf" id="db_ip" provider="heartbeat" type="IPaddr2">
  <instance_attributes id="db_ip-instance_attributes">
    <nvpair id="db_ip-instance_attributes-ip" name="ip" value="150.150.150.150"/>
    <nvpair id="db_ip-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
    <nvpair id="db_ip-instance_attributes-nic" name="nic" value="eth0"/>
  </instance_attributes>
  <operations id="db_ip-operations">
    <op id="db_ip-op-monitor-10" interval="10" name="monitor" timeout="20s"/>
  </operations>
</primitive>

Cause

The resource monitor operation had been removed from the IP Address primitive.

Additional Information

SUSE Linux High Availability Extension (HAE) allows administrators to continually monitor the health and status of their resources, manage dependencies, and automatically stop and start services based on highly configurable rules and policies.

All OCF (Open Cluster Framework) Resource Agents are required to have at least the actions: start, stop, status, monitor and meta-data.

Unless told otherwise, the cluster will not ensure that your resources are still healthy. The cluster administrator must add monitoring operations to each resource he wants the cluster to watch and monitor for its health. Monitor operations can be added for all classes of resource agents.

Without a monitor operation on a resource, once started, the cluster does not check if the resource is correctly running or healthy.

For example, if there is not a monitor operation on an IP Address primitive, the address could suddenly become unavailable (for whatever reason), and clustering would not know about it nor take any action to resolve it. This could dramatically affect any applications dependent upon that address; perhaps hours will pass before the problem is recognized and resolved.

If, with the same scenario, there is a monitor operation assigned to that same IP Address primitive; then if there is a problem with the address, clustering will recognize it, and try to resolve the problem. This could be stopping the resource and restarting it on another available node and/or fencing the current node so that it can behave better in the future. This is what high availability is designed to do: automatically recognize failures and try to resolve it to allow managed resources to be available for the user-base.

If a customer has chosen to remove monitor operations from one or more primitive Resource Agents, they should understand the risks of doing so.

(1) Clustering will not detect any failures of this resource.
(2) Clustering will not automatically try and resolve any failures of this resource.
(3) Any application dependent upon this resource may be unavailable.
(4) The administrator or user-base must detect any resource failures.
(5) The administrator must manually resolve any resource failures.

As long as these risks are understood, customers are welcome to remove monitor operations from primitives in their cluster.

There are some customers that remove monitor operations from their primitives because they believe the cluster is fencing nodes more often than it should (false-positive fences). They believe that the removal of cluster monitors will increase cluster and resource uptime. Our suggestion to this is that rather than removing the monitor operations, adjust the timeouts to appropriately handle the infrastructure in which they are deployed.

For resource monitor adjustments and/or deployment, please see the SLES 11 HAE Documentation found here:

https://www.suse.com/documentation/sle_ha/

or

https://www.suse.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/book_sleha.html

Chapter 4, Section 3 discusses Resource monitor and has links to documentation about how to implement monitors in Hawk, hb_gui, or the crm shell interface.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

Document ID:7012073
Creation Date: 03-Apr-2013
Modified Date:03-Mar-2020
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com