Cluster node will not join the cluster

  • 7001434
  • 24-Sep-2008
  • 08-Nov-2012

Environment

Novell Open Enterprise Server 2 (OES 2)
Novell Open Enterprise Server (Linux based)

Situation

This TID is to give troubleshooting steps.

Resolution

  1. Try a reboot of all nodes
  2. Turn off the firewall on all nodes, as this could be causing them to not get the LAN broadcasts from the Master Node.  Stop the firewall with "rcSuSEfirewall2 stop"
    For help on setting up the firewall with Novell Cluster Services see
    TID 7002738 - Using SuSEfirewall2 with Novell Cluster Services (NCS)
  3. Check the IP Address and subnet mask of each node and make sure they are on the same subnet.
  4. Confirm that the node is seeing the SBD partition.
    • Use “sbdutil -f” If it can not find it then try “sbdutil -f -s”
      server1:~ # sbdutil -f
           /dev/evms/.nodes/cluster.sbd
      server1:~ # sbdutil -f -s
           /dev/evms/.nodes/cluster.sbd

    • If this fails then you need to prove that you can see the SBD partition with the various disk utilities (NSSMU, fdisk, evmsgui, etc.)
    • Check the /etc/evms.conf file under the sysfs_devices section to make sure you are allowing the disk that the SBD partition is on.  If using multipathing this would be the multipath device.
  5. Confirm that you are communicating with LDAP by using “/opt/novell/ncs/bin/ncs-configd.py -init”.  This will use the first LDAP server that is listed in /etc/opt/novell/ncs/clstrlib.conf file.  If it is communicating with ldap you will see several lines similar to the following:
    If this fails, then modify the LDAP server IP address in clstrlib.conf or recreate the clstrlib.conf file following TID 3147787 "How to recreate clstrlib.conf file for Novell Cluster Services"
          server1:~ # /opt/novell/ncs/bin/ncs-configd.py -init
               dos2unix: converting file /var/opt/novell/ncs/CP1_SERVER.load to UNIX format ...
               dos2unix: converting file /var/opt/novell/ncs/CP1_SERVER.unload to UNIX format ...
               dos2unix: converting file /var/opt/novell/ncs/CP1_SERVER.monitor to UNIX format ...
               dos2unix: converting file /var/opt/novell/ncs/Master_IP_Address_Resource.load to UNIX format ...
               dos2unix: converting file /var/opt/novell/ncs/Master_IP_Address_Resource.unload to UNIX format ...
               dos2unix: converting file /var/opt/novell/ncs/iPrint_Template.load to UNIX format ...
  6. Confirm that your node is listed in the nodes.xml file on each cluster server.  If it does not exist here then you will have to re-add this node to the cluster.
         server1:/var/opt/novell/ncs # cat nodes.xml  | grep server1
              <dsml:entry dn="cn=server1,cn=cluster,o=novell">
                    server1
  7. Panning ID problem:
    • In /var/log/messages you see "CLUSTER-<INFO>-<2090>: Join retry, some other node acquired the cluster lock"
      This means that one of the other nodes has a lock on the SBD partition but this node is not communicating with that node over the LAN.
    • INFO: The panning ID allows multiple clusters to share the same LAN but remain isolated from each other, ignoring each other's packets.  The Panning ID
      is constant for all nodes in the same cluster, and each cluster will have a unique Panning ID.
    • Check the Panning ID to confirm if this is the problem.
      • In iManager go to "Clusters" | "Cluster Options" and select your cluster and then select the "Properties..." button.
        Write down the "panning clusterid 3294344161"
      • Confirm what the /var/opt/novell/ncs/gipc.conf has for the "panning clusterid" they should be the same.  This file is recreated when cluster information is pulled down from edirectory so do not modify it here if it is incorrect.
      • Take a LAN trace for a few seconds, enough to capture a heartbeat packet from the cluster. By default, cluster heartbeat packets go out every second. Find a heartbeat packet in your trace that goes with this cluster. The panning ID (in Hex) will be 4 bytes, beginning at offset 26 (hex) of the packet.  See KB 3075104 for more details on finding the panning ID if your LAN analysis tool does not interpret it for you.
        In the LAN trace you will find heartbeat packets from the server that is trying to join the cluster and from the server that is holding the Master IP address resource.  Look at both heartbeat packets and determine if they have the same panning ID.
    • To determine what the correct panning ID should be do the following
      • In iManager go to "Directory Administration" | "Modify Object" select your cluster object, then OK, highlight "GUID" and select "Edit..." Write down the first 8 characters.  GUID: e1b35bc4a4c25e4ed39ae1b35bc4a4c2 so the first eight are e1b35bc4.  Double check this with /var/log/messages.  As the cluster tries to join you will see nCSGuid = e1b35bc4-a4c2-5e4e-d39a-e1b35bc4a4c2.  Search on "nCSGuid"
      • Now reverse these eight characters, two characters at at time so e1b35bc4 becomes c45bb3e1.
      • Convert from hex to decimal and you should have the Panning ID, c45bb3e1 hex = 3294344161 decimal.
      • In the LAN trace the Panning ID will show one higher c45bb3e2.  This is normal.
    • To correct an incorrect Panning ID
      • Power off all of the nodes in the cluster at the same time, then reboot and try to join the cluster.
      • If that does not work, unload clustering on all nodes with "rcnovell-ncs stop" then in iManager go to "Directory Administration" | "Modify Object" select your cluster object, then OK, highlight "NCS:GIPC Config" then "Edit..." and modify the "panning clusterid 3294344161".  If the "NCS:GIPC Config" attribute is not a valued attribute then you do not need this attribute (it will only be there if this cluster was ever on NetWare), so there is nothing to do on this step.
      • Run "/opt/novell/ncs/bin/ncs-configd.py -init" from all nodes this will pull down the new panningID into /var/opt/novell/ncs/gipc.conf.
      • Restart clustering with "rcnovell-ncs start"
  8. When clustering loads (novell-ncs start) it will run the /opt/novell/ncs/bin/ldncs script.  Modify this script to enable debugging to the /var/log/messages file.  Remove the # (comment) symbol before the echo commands.
         ...
           echo -n "TRACE ON"> /proc/ncs/vll
           echo -n "TRACE SBD ON"> /proc/ncs/vll
           echo -n "TRACE GIPC ON"> /proc/ncs/vll
           echo -n "TRACE MCAST ON"> /proc/ncs/vll
           echo -n "TRACE CVB ON"> /proc/ncs/cluster
           echo -n "TRACE CSS ON"> /proc/ncs/cluster
           echo -n "TRACE CRM ON"> /proc/ncs/cluster
           echo -n "TRACE CMA ON"> /proc/ncs/cluster

          # export NCSCONFIGD=1
          # export NCSRESOURCED=1

           echo -n "debug"> /admin/adminfs.cmd
          ...
    To turn off Cluster debugging, you may put the # (comments) back in the /opt/novell/ncs/bin/ldncs file but will also need to manually turn off adminfs debugging using the following command.
           echo -n "-debug"> /admin/adminfs.cmd