The following list identifies the most common things to look for when troubleshooting LAN-related problems on a cluster.
The following list identifies the most common things to look for when troubleshooting SAN-related problems on a cluster.
Try replacing a damaged cable with a new one.
Try switching the GBIC to a different node if a spare GBIC is not available.
Try switching to a different port.
Try cleaning the laser light components.
Verify that the SAN devices and drivers are certified for the version of NetWare® you are running.
Your hardware vendor can provide vendor-specific SAN diagnostics based on the model of hardware you are using.
Intentional Abends
The following links provide detailed information on how to avoid intentional abends.
Understanding
the Heartbeat Process
The first node that joins the cluster automatically
becomes the master node in the cluster. Subsequent nodes that join the
cluster become slave nodes. The slave nodes periodically
perform a heartbeat based on the heartbeat parameter
configured for the cluster by sending a unicast TCP/IP packet to
the master node. The master node periodically performs a heartbeat
across the LAN based on the master watchdog parameter by sending
a multicast packet to all slave nodes.
In addition, each node performs a heartbeat across the SAN by periodically (based on one half of the tolerance parameter) increasing a counter value stored in its own section of the cluster services partition. Each node writes in its own space and reads all other nodes’ sectors prior to writing its own update.
Note: By default, the master and slave heartbeats take place every second, with a tolerance of eight seconds.
A failure detection algorithm is initiated any time a node experiences a continuous failure to a heartbeat equal to the tolerance parameter. The master and slave nodes then communicate over the LAN to form a new cluster based on the nodes chosen by the failure detection algorithm. A new master node is also elected during this phase. The new master node could be the same node as before the failure. Because only one node is allowed to control any given cluster resource at a time, any node that fails to perform the heartbeat is removed from the cluster.
LAN
Failure or Congestion Issues
A frequently asked question is why clusters do not have dedicated TCP/IP networks for heartbeat traffic. Because heartbeat traffic is very light, the heartbeat process itself does not necessitate a dedicated network. Additional reasons you would not want a dedicated network for heartbeat traffic include the following:
Enabling IP Packet Forwarding on the cluster nodes negatively affects client reconnection. This poses a trade-off between managing and dedicating the cluster traffic (unless you have another method of routing to the heartbeat network other than a cluster node).
Because of these and possibly other reasons, many clients decide not to implement a heartbeat network. You need to understand your production network environment and the heartbeat tuning parameters before deciding whether or not to implement a heartbeat network.
Based on the defaults, if eight seconds of consecutive heartbeat packets are dropped, a node is removed from the cluster. If your network is congested or your VLAN includes multiple switches between the cluster nodes, you might have cases where nodes are removed from the cluster simply because the heartbeats didn't get through your network. If your network is too congested to support the default settings, you can increase the heartbeat tolerance. If you increase heartbeat tolerance, be aware that this might adversely affect transparent client reconnections.
As a general rule, if a client can’t reconnect within 60 seconds, it probably will never reconnect without performing a new login. You should calculate how long it takes your resources to migrate and determine the amount of time necessary to add to the the tolerance. If you can’t increase the tolerance sufficiently enough to overcome LAN congestion problems, you must either resolve LAN congestion problems or implement a dedicated heartbeat network.
One of the most effective ways to resolve intentional abendss is to ensure that you have the latest NetWare Support Packs and the latest vendor-supplied LAN drivers installed. In most cases, updating LAN drivers will resolve intentional abends.
You should always manually set the LAN card driver to the same speed and duplex setting that is used on your switch. Also, manually set the switch to the proper speed and duplex setting for your network. Avoid using automatic speed or duplex detection on servers or the switch ports that the servers are connected to. This is even more important if you have hardware from different vendors. For example, you might be using an Intel* NIC with a 3COM* switch.
Another LAN card problem involves the improper implementation of the link-indicating counters. In a two-node cluster, Novell® Cluster ServicesTM watches the following counters to determine if a node is communicating on an Ethernet LAN:
In a two-node cluster, Novell Cluster Services needs to determine the proper node to bring down in the event heartbeats are not getting through; it can’t just assume that the master node is the good node and bring down the slave. It determines which node is good by monitoring the LAN counters and determining which node is actively communicating. If it can’t determine which node is good, it will then bring down the slave node. Unfortunately, not all LAN drivers implement counters. If the LAN driver doesn't implement counters, the master node will always survive, and the slave node will always fail when the heartbeat is not received.
Note:
The feature to detect LAN traffic was added to the Novell Cluster Services 1.01 two-node tiebreaker patch. Without this patch, the master node always wins. This feature also exists in NCS 1.6 and will exist in later versions.
Tuning
Server Configuration to Avoid False Split Brain Conditions
NetWare servers by default are tuned to support 200 to 400 client connections. If you have more than 400 connections, you need to modify several parameters to help avoid poison pill conditions.
Some of the parameters you might want to increase are listed in the following table:
| Parameter | Explanation |
| Service Processes | If the server does not have enough service processes for all the processes running, performance might degrade to the point where a heartbeat is not trasmitted within the specified tolerance. |
| Packet Receive Buffers | If the server does not have an available packet receive buffer when an incoming packet arrives, the server drops the packet and increments the packet receive buffers until it reaches the maximum. When the packet receive buffer reaches its maximum, it drops packets until it catches up and empties the buffer. By default, dropping eight seconds of consecutive packets is enough to assume that the monitored server is down. |
| LAN Speed | If the switch is set to 100 Mb and the server is set to 10 Mb (or vice versa), sporadic communications occur which will eventually cause an abend. |
| LAN Duplex | If a server is set to full duplex but the switch is set to half (or vice versa), slow and sporadic communications occur which will eventually cause an abend. |
Categorizing
the Four Types of Split Brain Conditions
In general, split brain conditions fall into one of four categories:
For each of these categories, the following sections provide some basic troubleshooting ideas.
Fatal SAN Errors
Any cluster node that cannot read or write to the shared storage is essentially useless to the cluster and must be removed. If the node cannot read or write to the split brain detection partition, it will intentionally remove itself by eating one of the following poison pills:
Each of these abends is caused by a fatal I/O error or device alert, which is signalled by the SAN device driver when invoked by the SBD.NLM module. As with nodes that can't communicate on the LAN, clean shutdowns to misbehaving nodes are problematic, so the node must force itself out of the cluster immediately by eating a poison pill. If you are receiving any of these fatal SAN errors, start by troubleshooting your hardware and the device drivers.
These abends are caused by the device driver passing a fatal error to Novell Cluster Services, which means that you have either a hardware fault or a problem with the driver. Check with hardware manufacturers for tools to help you troubleshoot hardware devices. A fatal SAN error generally signifies an error with the SAN hardware or the SAN driver on the server. For more information on SAN errors, see SAN Related Problems.
In some cases, the fiber channel card tries to use High Memory, which causes major instability. Increasing the FILES and/or BUFFERS statements in the CONFIG.SYS beyond 100 to 150 prevents fiber channel cards from doing this and stabilizes the system.
Another common error with SAN implementations is a failure to match the fiber channel Host Bus Adapter (HBA) to the SAN topology. Many vendors have generic cards that work for PPP, FC-AL, and Fabric SAN implementations. In these cases, there might be a jumper or BIOS setting that can be used to configure the card for the proper SAN topology. In other cases, the hardware vendor requires different cards to match the topology. If this is the case, you can’t use a PPP card if your server is attached to a fabric switch.
False Node Failure Detection
False node failure detection is different from a split brain condition in that the cluster thinks the node is dead due to a lack of heartbeat packets when, in reality, the node is alive. In a classical split brain condition, each side of the split brain thinks that the other side has failed and that it needs to start the other side’s resources. In a false split brain condition, one side believes the other side has failed, while the other side thinks everything is fine.
There are two categories of false node failure detection:
Understanding the Sleepy Node Syndrome
There are currently three known situations where a node can appear to have failed, not perform its required abend, and then appear to come back to life. You should be aware of these situations so you can avoid them. Because other nodes take over the resources from the node that appeared to fail, if the node comes back to life and continues on, file system corruption will likely occur because the node believes it is still a member of the previous cluster and has no reason to believe it doesn't still own the resources it had prior to going to sleep. The node continues on as before only until the next time it sends a heartbeat to the shared storage (four seconds, by default), at which time the node realizes that it was given a poison pill and abends.
Since there is potential for data corruption by a node writing to a volume that is already mounted elsewhere, the following three known cases of this occurring should be avoided.
The first case of a sleepy node occurs when a node enters and stays in Real Mode for a period of time equal to or greater than the heartbeat threshold parameter. Because the NetWare floppy disk driver can execute in Real Mode longer than the threshold period, avoid using the floppy drive from a cluster node. If you must use the floppy disk, copy any NLMTM programs from the floppy to the server, and then run them from the server. Loading an NLM directly from a floppy will cause the server to stay in Real Mode too long to respond to the heartbeats from other nodes. It will likely cause a false node failure detection problem.
The second case occurs when an administrator suspends a node by bringing it into the system kernel debugger and then restarts the node after the threshold parameter has passed. If you need to use the kernel debugger, use the CLUSTER DEBUG cluster console command, which halts all nodes, or use the HTML-based NetWare Management Portal and its nonintrusive debugging tools.
If you switch to the system kernel debugger, make sure you either remove the server from the cluster with the CLUSTER LEAVE command or execute the Quit (Q) command to exit to DOS.
Note:
Novell Cluster Services SP2 for NetWare 5 and Novell Cluster Services 1.6 for NetWare 6 address kernel debugger problems. These versions do not allow you to type G or T in the debugger unless the SET parameter for the Developer Option is set to On.
The third case which can contribute to sleepy node syndrome is CPU Hog. Software that hogs the CPU can contribute to a poison pill abend.
It is important to understand CPU Hog, because unless you make some modifications to your configuration, you will never know the problem is a CPU Hog, or which module is causing the problem.
An application is considered a CPU Hog if it fails to voluntarily release the CPU as required. If an application takes longer to release the CPU than the heartbeat tolerance, a poison pill occurs because the cluster was not allowed to transmit the required heartbeats. Because the default CPU Hog timeout interval is set to 60 seconds, you would never know that the cause of the abend was a specific application hogging the CPU.
To help determine if one of your applications is hogging the CPU for too much time, adjust the CPU Hog Timeout Amount server parameter to a value less than the heartbeat tolerance parameter. This will cause cluster servers to abend with a CPU hog in the problematic module rather than abending due to a poison pill. Once you identify the problematic module, you can determine the best way to resolve the problem.
Due to the single-threaded nature of bindery services, you should also eliminate bindery emulation on all of your cluster nodes. Excessive bindery contexts could contribute to a CPU hog problem, but would not point you to a specific module that is causing the problem.
To deal with a CPU hog of another sort, we highly recommend that you eliminate IPXTM from your cluster nodes because all resources must be serviced via TCP/IP to allow automatic reconnection. If you cannot eliminate IPX, then you should at least eliminate the use of the IPXRTR module.
This module has a tendency to periodically hog the CPU for 10 or more seconds, which normally would not be a problem but with a cluster, it is sufficient to cause false poison pill conditions. In many cases, stability problems are completely resolved by eliminating the use of IPXRTR.
Note:
Using INETCFG to disable IPX routing does not remove the IPXRTR module. Consider placing the load and bind commands for IPX in the AUTOEXEC.NCF instead of using INETCFG if you can't eliminate IPX altogether.
Understanding the Divergent View Syndrome
The second category of false node failure detection is the case where the cluster doesn't see the node's heartbeat but the node does see the rest of the cluster. This can be caused when the node’s LAN transmit is not functioning but the receive is. Similar issues could arise if the master node's multicasts can't get through a switch but a slave's unicast packets can. This would result in slaves thinking the master is dead, while the master sees all of the slaves as alive.
This situation typically results in an abend with a message similar to "Ate poison pill in XXX given by some other node," with XXX varying depending on the specifics of the communications problem. See the table on Split Brain/Poison Pill ABENDS belowfor more information.
For the condition described above, consider the following potential solutions:
Split Brain Conditions
Split brain conditions generally occur as a result of LAN hardware or software problems. A split brain is a condition where not all of the nodes agree on which servers should be members of the cluster. There is a split in agreement on the view of the cluster membership, with each side of the split thinking that the other side failed.
The following list provides recommendations to help you troubleshoot split brain conditions:
Look for excessive NO ECB Available errors as well as any other types of packet errors.
In addition to the above configuration issues, also check and resolve any LAN connectivity issues. Use the appropriate diagnostic tools for your LAN (sniffer, probes, etc.) to determine if your LAN is causing delays due to congestion, misbehaving hosts, etc.
You can also try increasing the cluster tolerance and slave watchdog parameters, which cause the servers to be more tolerant of delayed packets. However, be aware that increasing the parameters excessively might negatively affect automatic client reconnection.
If you can’t diagnose or resolve the above-mentioned problems, consider implementing a dedicated heartbeat network to isolate the heartbeat traffic. You might have too much traffic on the production LAN for the heartbeat to function efficiently.
NetWare 5.1 and NetWare 6 include a new diagnostic tool that can help you troubleshoot LAN errors at the LSLTM layer. This tool produces an LSL Statistics Monitor screen that can help you determine if you are having issues with dropped packets due to Event Control Block issues. An example of the LSL Statistics Monitor screen is shown below.
Categorizing Various Split Brain/Poison Pill Abends
The following table lists some of the most common split brain/poison pill abends, their categories, and descriptions.
| ABEND | Category | Description |
| CLUSTER: Node castout, fatal SAN read error. | Fatal SAN error | The SAN device driver detects a fatal (nonrecoverable) error while reading from the shared storage. |
| CLUSTER: Node castout, fatal SAN write error. | Fatal SAN error | The SAN device driver detects a fatal (nonrecoverable) error while writing to the shared storage. |
| CLUSTER: Node castout, fatal SAN device alert. | Fatal SAN error | The SAN device driver detects a fatal (nonrecoverable) error while communicating or attempting to communicate with a shared storage device. |
| Ate poison pill in sbdProposeView given by some other node. | False Node Failure Detection | Communications problems cause a divergent view between this node and the cluster. |
| Ate poison pill in sbdWriteNodeTick given by some other node. | False Node Failure Detection | Communications problems cause a divergent view between this node and the cluster. |
| Ate poison pill. Link is down. Other node is alive and ticking. | Split Brain Condition | A two-node cluster condition where the LAN counters are incrementing on the other node but aren't incrementing on this node. Indicates a LAN card or driver failure. |
| This node is in the Minority partition and the node in the Majority partition is alive. | Split Brain Condition | A split brain condition where this server is on the losing side of the vote. |
| At least one of the nodes is alive in the old master node's partition. This node is not in the old master node's partition. | Split Brain Condition | A split brain condition where there is a tie vote. In the case of a tie, the side with the master node wins. This server is on the side that does not contain the master node. |
| The alive partition with the highest node members should survive. This node is not in the alive partition with highest node number. | Split Brain Condition | A split brain condition where the master node is not available because it left the cluster or failed. The side that contains the most nodes wins. This server is not on that side. |
| This cluster node failed to process its self-leave event in a timely fashion and will be forced out of the cluster. | Stalled self-leave | The node tries to leave the cluster, but for some reason it stalls. Because it does not leave cleanly, it is impossible to guarantee that the resources are safe to start on new nodes. The node must bring itself down so that resources safely start elsewhere in the cluster. |
| CRM: CRMSelfLeave: Some resources went in comatose state while SelfLeave. | Stalled self-leave | This situation is similar to the previous one except that a failure is detected while running a resource unload script. It isn't safe to assume this node cleanly stopped the resources that were running on it, so it removes itself from the cluster via an abend to allow the clean start of resources on another node. |
A trademark symbol (®, TM, etc.) denotes a Novell trademark. An asterisk (*) denotes a third-party trademark. For information on trademarks, see Legal Notices.