How to troubleshoot cluster resources that go "comatose" on OES 2 Linux

  • 7001397
  • 19-Sep-2008
  • 08-Nov-2012

Environment

Novell Open Enterprise Server 2 (OES 2)

Resolution

When resources go "Comatose" it means that one or more commands in the load or unload script failed.
You need to identify the line in the load or unload script that is causing the problem.

In /var/run/ncs/ you will find a copy of the RESOURCE.load and RESOURCE.unload scripts these are the local script files that clustering users to online and offline the resource.  These are pulled down from edirectory when the resource is onlined so if you need to modify these scripts please do it via iManager Cluster Options and edit the load and unload scripts there.

In the /var/run/ncs directory you will also find the RESOURCE.load.out file and the RESOURCES.unload.out files.  (After the July 16, 2009 update the *.out files are moved to the /var/opt/novell/log/ncs/ directory).  These files give you a detailed look at what happened when the resource was onlined or offlined and the results of each of the commands in the scripts.  This is where you need to look to see why the resource went comatose.  Each of the *.out files is overwritten each time the resources is onlined or offlined on that specific server. So you need to know what server tried to do the online/offline  and went comatose, and then look at the *.out file on that specific server.


Example of an NSS Pool load script:
server1:/var/run/ncs # cat CP1_SERVER.load
#!/bin/bash
. /opt/novell/ncs/lib/ncsfuncs
exit_on_error nss /poolact=CP1
exit_on_error ncpcon mount VOL1=254
exit_on_error add_secondary_ipaddress 151.155.242.45
exit_on_error ncpcon bind --ncpservername=CLUSTER_CP1_SERVER --ipaddress=151.155.242.45
exit 0

Common reasons why we go comatose.
  1. The resources IP address is already in use by another server on the network. A ping of this address should not get a reply.
  2. The Volume ID is already in use by another volume.  In the example above "exit_on_error ncpcon mount VOL1=254".  Make sure each volume has a unique volume ID, and that they are less than 254.
  3. Can not complete the "ncpcon bind" command.  NCP is loaded with ndsd (edirectory).  Do a "rcndsd restart" to restart edirectory thus reloading NCP.
  4. The Timeout was too short so we hit that before the script could complete.  Increase the Timeout in Imanager | Cluster Options | Cluster Resource | Load and Unload Scripts.  Default is 1 minute, increase to 10 minutes.
  5. Make sure you are not using the following Variable Names in cluster scripts.  These are hardcoded in the ". /opt/novell/ncs/lib/ncsfuncs" file that is included in the second line of the cluster scripts.
    1. IP_ADDR
    2. FILE_SYSTEM
    3. OCF_DIR