Novell Cluster Services is unable to detect a network link down when installed on a Xen domain 0 (dom0)

  • 7004595
  • 05-Oct-2009
  • 27-Apr-2012

Environment


SUSE Linux Enterprise Server 10 Service Pack 2
Novell Open Enterprise Server 2 (OES 2) Linux
Novell Cluster Services
Xen

Situation

When Novell Cluster Services loads, it will use the default configured ethernet device, for example eth0. The default behaviour of the Xen network-bridge script is to rename the default ethernet device to peth0, and eth0 is then added as a virtual device on the Xen bridge.

Should a network failure occur or be simulated, peth0 will be reported as having no network connectivity and not eth0. This results in Novell Cluster Services not being aware of the network outage, as the virtual Xen bridge eth0 device is never reported as being down.

Resolution

There are two possible solutions to this;

A. Follow https://support.microfocus.com/kb/doc.php?id=7000616 which provides an overview and examples on how to write a new Xen network-bridge script to prevent device renaming, for use by the Xen bridge.

B. Use two Network Interface Cards, one for Novell Cluster Services and a second one for use by the Xen DomU's with default bridging. To specify the default device to use for the Xen bridge do the following;
  1. Edit "/etc/xen/xend-config.sxp".
  2. Look for the line that executes the network-bridge script, which will look something like "(network-script network-bridge)".
  3. Modify the above line to read like the following example "(network-script 'network-bridge netdev=eth1')" where "eth1" is the device for use as the default Xen bridge device.

Additional Information

To understand how this negatively impacts Novell Cluster Services consider the following;

First something about the Master IP Address resource, this is used by Novell Cluster Services for management purposes but also in oder to determine node survival in certain scenarios when a perceived node failure occurs. In a two node cluster, if both nodes believe the other node failed (due to not receiving network heartbeat packets anymore for example), the node hosting the Master IP Address resource will survive and force the other node to restart.

Taking all of the above into account, below is an exmaple of what can then happen in a two node cluster;
  1. Node1 and Node2 are two Xen Dom0's each with one ethernet device, eth0, each part of the same Novell Cluster.
  2. When Novell Cluster Services loads it will use the default device, in this case eth0.
  3. When the Xen network-bridge script is executed, eth0 is renamed to peth0 and eth0 is added as a virtual device on the Xen bridge.
  4. Executing "cluster status" from the server console, we see the Master IP Address resource is running on Node1.
  5. We now simulate a network failure on Node1 by removing the LAN cable.
  6. In /var/log/messages is reported something like "kernel: <nic_driver>: peth2 NIC Copper Link is Down".
  7. Immidiately after this, we normally would expect to see another message from Novell Cluster Services stating "gipc link down" (gipc is the Novell Cluster Services module monitoring for network state changes). However this never happens, since gipc is monitoring eth0.
  8. Node1 is hosting the Master IP Address resource, and thus node2 is forced out of the cluster and restarts, despite this node being the actual node that still has good network connectivity.