Error: "Join retry, some other node acquired the cluster lock"
Novell Cool Solutions: Tip
By Reuben Bryant
Digg This -
Updated: 16 Mar 2006
16 Mar 2006 - Updated with reader comments here
When changing the default gateway and subnet mask on NetWare cluster nodes, and restarting the cluster nodes, one of the nodes joins the cluster and becomes the master node. The other one tries to join, but comes up with an error of: "Join retry, some other node acquired the cluster lock".
I also noticed that on the node that was master, under "TCPCON>Statistics>IP>Outbound discarded datagrams>No route found" was climbing at a rate of 1 packet a second and there was a rouge routing entry was showing in the routing table that returned after we removed it. We were able to ping the other node and DNS and DS was working. We were also able to ping and trace route all the other servers in that VLAN.
The network switches had been previously restarted in an unsuccessful attempt to resolve this.
By enabling OSPF on both the cluster nodes, inline existing OSPF network structure, it fixed the problem.
To enable OSPF:
INETCFG>PROTOCOLS>TCP/IP>Enable (and set area config)
INETCFG>BINDINGS>TCP/IP>Select NIC>Bind Options>OSPF Bind Options>Enable (and select area)
- Netware 6.5 Sp3 CPR Release
- Novell Clustering services 1.8
- Moving from a flat LAN to VLAN setup.
- IP addresses were not changed on any server or cluster node.
To understand this error you need to recall that Novell Cluster Services uses two methods of testing connectivity and presence of nodes. The first is through the shared disk channel, via the SBD partition. The second is through the use of a network heartbeat.
This allows NCS to handle the following cases:
- If a node is up, talking to the disk, but off the network. (Network problem)
- If a node is up, on the network, but not talking to the disk. (Disk channel problem).
- Node is completly down, off the network and not talking to the disk.
- Node is up and talking to both network and disk.
The error seen is when a node is coming up and trying to join the cluster. It can see the shared SBD (Split Brain Detection) partition, and is trying to tell the cluster it is joining, but is incapable of seeing the cluster nodes on the network.
Thus after a network change (moving VLANs, changing topology, routing paths) you might run into this until all nodes are back in communication.
A further clue is the error incrementing every second, which is of course the default heartbeat time. Probably the heartbeat packet is not making it, across the network until the routing change was made.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com