On a NCS cluster that operates without NSS and SBD, nodes do not fence after loosing network connectivity.

  • 7015426
  • 25-Jul-2014
  • 20-Oct-2014

Environment

Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 1
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 2
Novell Cluster Services

Situation

It is possible to operate Novell Cluster Services without both Novell Storage Services (NSS), and without an SBD partition, to achieve for example high-availability in a fail-over cluster for services that do not explicitly require data to be stored on shared storage.

In this particular setup, OES11 SP2 servers are installed without NSS, and configured with Novell Cluster Services (NCS operating without SBD partition) and a cluster resource to provide LDAP high-availability.

It was observed that when the network connection was lost on either of the a cluster nodes, the resources that resided on that particular cluster node did not fail-over to surviving nodes in the cluster as would be expectation with NCS clustering.

Resolution

The problems is resolved with a newer version of the Novell Cluster services modules which was released with the OES11 SP2 September 2014 Scheduled Maintenance Update.

Cause

There was a problem in the GIPC logic when the cluster would operate without an SBD partition.

Additional Information

Testing this in a straight forward 2 Node cluster setup, it was observed that when the cluster node holding the Master_IP_Address resource would become disconnected from the network, NCS would properly report GIPC was down, and it was also observed that the Master_IP_Address resource would properly load on the surviving node, however, the node with the network problems would never fence itself out of the cluster, and remained up and running.

Effectively, we now ended up in the undesirable situation where two different nodes would both host the Master_IP_Address resource (or any other NCS resources for that matter) at the same time, which could potentially cause data corruption once the network problem would be restored to normal.