Server Abends

One of the most important problems that requires troubleshooting Novell® Cluster ServicesTM is that of intentional abends. Intentional abends is functionality that is included with Novell Cluster Services to cause node isolation. Before discussing how to troubleshoot intentional abends, a more detailed discussion will help you understand why it is included with Novell Cluster Services.

A split-brain condition is a condition in which a single node or a group of nodes becomes isolated from the other nodes in the cluster. Consider, for example, a case where three nodes in a cluster are connected to one switch and another six nodes in the same cluster are connected to a different switch. If the cross-connect between the two switches were to fail, a split-brain condition would then exist. The three nodes connected to one switch would believe that the other six nodes failed, and the six nodes would believe that the three-node group had failed.

If this condition were allowed to continue, the group of three nodes would start activating and mounting the resources that were currently running on the group of six nodes. Meanwhile, the group of six would start activating and mounting the resources that were currently running on the group of three.

Novell Cluster Services does include a distributed lock for the Split Brain Detection partition used by the clustering software. But it does not support a lock across cluster nodes for user data. So if two servers were allowed to write to the same volume at the same time, there would be no way to prevent file system corruption.

To prevent file system corruption, Novell Cluster Services uses the Cluster Services partition that is created on the shared storage system during the installation. Each node is assigned its own disk space on this partition, and each node performs a periodic write to its own area. In addition to writing to its own area, each node can also read the information on the disk space areas of all the other nodes. If a cluster node can no longer access the Cluster Services partition, it removes itself from the cluster.

In this three-node/six-node example, suppose both the three-node group and the six-node group still had access to the Cluster Services partition. In this case they could also see that while the other group is no longer communicating on the LAN, the group is still active in the cluster. This causes the split-brain algorithm to force a vote, in which case the group of six nodes wins and the group of three nodes loses. Each node on the losing side then "eats the poison pill," meaning it causes a self-inflicted abend to remove itself from the cluster and stop all processes on the server.

Why intentionally cause the servers to abend? Why not just have the servers leave the cluster or issue a DOWN command to bring them down? There are several reasons why it is preferable to cause the servers to abend. The two most important reasons are as follows:

So although an abend decreases the availability of a single node in the cluster, it actually improves overall availability of the actual services by quickly restarting these services on nodes that are not experiencing problems.

For more information on intentional abends and how to prevent them, see LAN and SAN related problems.

A trademark symbol (®, TM, etc.) denotes a Novell trademark. An asterisk (*) denotes a third-party trademark. For information on trademarks, see Legal Notices.