17.5 Diagnosing a Poison Pill Condition

A poison pill is given to a node if it stops sending out the Novell Cluster Services heartbeat packets, including the case when the node hangs or reboots.

To evaluate the poison pill condition on the node, look at the /var/log/messages log from the node that rebooted (was given the poison pill). Check the messages right before the restart. Normally, you can spot the process that caused the server to hang or reboot.

You can run the cluster stats display command on the surviving master node to show when the Novell Cluster Services heartbeat packets stopped coming from the rebooted node. To be sure that is the case, you can also run a LAN trace to record the packets among the nodes.

Other events that might cause a pause in sending out heartbeat packets include the following:

  • Antivirus software scanning the /admin file system

  • Network traffic control (including the firewall)

  • check_oesnss_vol.pl running under Nagios

  • Packets that have been recently updated through the YaST Online Update or patch channels