7.6 Resolving Business Continuity Cluster Failures

There are several failure types associated with a business continuity cluster that you should be aware of. Understanding the failure types and knowing how to respond to each can help you more quickly recover a cluster. Some of the failure types and responses differ, depending on whether you have implemented SAN-based mirroring or host-based mirroring. Promoting or demoting LUNs is sometimes necessary when responding to certain types of failures.

NOTE:The terms promote and demote are used here in describing the process of changing LUNs to a state of primary or secondary, but your SAN vendor documentation might use different terms such as mask and unmask.

7.6.1 SAN-Based Mirroring Failure Types and Responses

SAN-based mirroring failure types and responses are described in the following sections:

Primary Cluster Fails but Primary SAN Does Not

This type of failure can be temporary (transient) or long-term. There should be an initial response and then a long-term response based on whether the failure is transient or long-term. The initial response is to restore the cluster to normal operations. The long-term response is total recovery from the failure.

Promote the secondary LUN to primary. Cluster resources load (and become primary on the second cluster). If the former primary SAN has not been demoted to secondary, you might need to demote it manually. The former primary SAN must be demoted to secondary before bringing cluster servers back up. Consult your SAN hardware documentation for instructions on demoting and promoting SANs. You can use the cluster resetresources console command to change resource states to offline and secondary.

Prior to bringing up the cluster servers, you must ensure that the SAN is in a state in which the cluster resources cannot come online and cause a divergence in data. Divergence in data occurs when connectivity between SANs has been lost and both clusters assert that they have ownership of their respective disks.

Primary Cluster and Primary SAN Both Fail

Bring the primary SAN back up and follow your SAN vendor’s instructions to remirror and, if necessary, promote the former primary SAN back to primary. Then bring up the former primary cluster servers and fail back the cluster resources.

Secondary Cluster Fails but Secondary SAN Does Not

No additional response is necessary for this failure other than recovering the secondary cluster. When you bring the secondary cluster back up, the LUNs are still in a secondary state to the primary SAN.

Secondary Cluster and Secondary SAN Both Fail

Bring the secondary SAN back up and follow your SAN vendor's instructions to remirror. When you bring the secondary cluster back up, the LUNs are still in a secondary state to the primary SAN.

Primary SAN Fails but Primary Cluster Does Not

When the primary SAN fails, the primary cluster also fails. Bring the primary SAN back up and follow your SAN vendor’s instructions to remirror and, if necessary, promote the former primary SAN back to primary. You might need to demote the LUNs and resources to secondary on the primary SAN before bringing them back up. You can use the cluster resetresources console command to change resource states to offline and secondary. Bring up the former primary cluster servers and fail back resources.

Secondary SAN Fails but Secondary Cluster Does Not

When the secondary SAN fails, the secondary cluster also fails. Bring the secondary SAN back up and follow your SAN vendor’s instructions to remirror. Then bring the secondary cluster back up. When you bring the secondary SAN and cluster back up, resources are still in a secondary state.

Intersite SAN Connectivity Is Lost

Recover your SANs first, then remirror from the good side to the bad side.

Intersite LAN Connectivity Is Lost

Users might not be able to access servers in the primary cluster but can possibly access servers in the secondary cluster. If both clusters are up, nothing additional is required. An error is displayed. Wait for connectivity to resume.

If you have configured the automatic failover feature, see Section C.0, Setting Up Auto-Failover.

7.6.2 Host-Based Mirroring Failure Types and Responses

Primary Cluster Fails but Primary SAN Does Not

Response for this failure is the same as for SAN-based mirroring described in Primary Cluster Fails but Primary SAN Does Not in Section 7.6.1, SAN-Based Mirroring Failure Types and Responses. Do not disable MSAP (Multiple Server Activation Prevention), which is enabled by default.

Primary Cluster and Primary SAN Both Fail

Bring up your primary SAN or iSCSI target before bringing up your cluster servers. Then run the Cluster Scan For New Devices command from a secondary cluster server. Ensure that remirroring completes before bringing downed cluster servers back up.

If necessary, promote the former primary SAN back to primary. Then bring up the former primary cluster servers and fail back the cluster resources.

Secondary Cluster Fails but Secondary SAN Does Not

No additional response is necessary for this failure other than recovering the secondary cluster. When you bring the secondary cluster back up, the LUNs are still in a secondary state to the primary SAN.

Secondary Cluster and Secondary SAN Both Fail

Bring up your secondary SAN or iSCSI target before bringing up your cluster servers. Then run the Cluster Scan For New Devices command on a primary cluster server to ensure that remirroring takes place. When you bring the secondary cluster back up, the LUNs are still in a secondary state to the primary SAN.

Primary SAN Fails but Primary Cluster Does Not

If your primary SAN fails, all nodes in your primary cluster also fail. Bring up your primary SAN or iSCSI target and then bring up your cluster servers. Then run the Cluster Scan For New Devices command from a secondary cluster server. Ensure that remirroring completes before bringing downed cluster servers back up.

If necessary, promote the former primary SAN back to primary. You might need to demote the LUNs and resources to secondary on the primary SAN before bringing them back up. You can use the cluster resetresources console command to change resource states to offline and secondary. Bring up the former primary cluster servers and fail back resources.

Secondary SAN Fails but Secondary Cluster Does Not

Bring up your secondary SAN or iSCSI target before bringing up your cluster servers. Then run the Cluster Scan For New Devices command on a primary cluster server to ensure remirroring takes place. Then bring the secondary cluster back up. When you bring the secondary SAN and cluster back up, resources are still in a secondary state.

Intersite SAN Connectivity Is Lost

You must run the Cluster Scan For New Devices command on both clusters to ensure that remirroring takes place. Recover your SANs first, then remirror from the good side to the bad side.

Intersite LAN Connectivity Is Lost

Users might not be able to access servers in the primary cluster but can possibly access servers in the secondary cluster. If both clusters are up, nothing additional is required. An error is displayed. Wait for connectivity to resume.

If you have configured the automatic failover feature, see Section C.0, Setting Up Auto-Failover.