There are several failure types associated with a business continuity cluster that you should be aware of. Understanding the failure types and knowing how to respond to each can help you more quickly recover a cluster. Some of the failure types and responses differ depending on whether you have implemented SAN-based mirroring or host-based mirroring. Promoting or demoting LUNs is sometimes necessary when responding to certain types of failures.
NOTE:The terms promote and demote are used here in describing the process of changing LUNs to a state of primary or secondary, but your SAN vendor documentation might use different terms such as mask and unmask.
SAN-based mirroring failure types and responses are described in the following sections:
This type of failure can be temporary (transient) or long-term. There should be an initial response and then a long-term response based on whether the failure is transient or long term. The initial response is to restore the cluster to normal operations. The long-term response is total recovery from the failure.
Promote the secondary LUN to primary. Cluster resources load (and become primary on the second cluster). If the former primary SAN has not been demoted to secondary, you might need to demote it manually. The former primary SAN must be demoted to secondary before bringing cluster servers back up. Consult your SAN hardware documentation for instructions on demoting and promoting SANs. You can use the cluster resetresources console command to change resource states to offline and secondary.
Prior to bringing up the cluster servers, you must ensure that the SAN is in a state in which the cluster resources cannot come online and cause a divergence in data. Divergence in data occurs when connectivity between SANs has been lost and both clusters assert that they have ownership of their respective disks.
Bring the primary SAN back up and follow your SAN vendor’s instructions to remirror and, if necessary, promote the former primary SAN back to primary. Then bring up the former primary cluster servers and fail back the cluster resources.
No additional response is necessary for this failure other than recovering the secondary cluster. When you bring the secondary cluster back up, the LUNs are still in a secondary state to the primary SAN.
Bring the secondary SAN back up and follow your SAN vendor's instructions to remirror. When you bring the secondary cluster back up, the LUNs are still in a secondary state to the primary SAN.
When the primary SAN fails, the primary cluster also fails. Bring the primary SAN back up and follow your SAN vendor’s instructions to remirror and, if necessary, promote the former primary SAN back to primary. You might need to demote the LUNs and resources to secondary on the primary SAN before bringing them back up. You can use the cluster resetresources console command to change resource states to offline and secondary. Bring up the former primary cluster servers and fail back the resources.
When the secondary SAN fails, the secondary cluster also fails. Bring the secondary SAN back up and follow your SAN vendor’s instructions to remirror. Then bring the secondary cluster back up. When you bring the secondary SAN and cluster back up, resources are still in a secondary state.
Recover your SANs first, then remirror from the good side to the bad side.
Users might not be able to access servers in the primary cluster but can possibly access servers in the secondary cluster. If both clusters are up, nothing additional is required. An error is displayed. Wait for connectivity to resume.
If you have configured the auto-failover feature, see Section B.0, Setting Up Auto-Failover.
Host-based mirroring failure types and responses are described in the following sections:
Response for this failure is the same as for SAN-based mirroring described in Primary Cluster Fails but Primary SAN Does Not in Section 2.6.1, SAN-Based Mirroring Failure Types and Responses. Do not disable MSAP (Multiple Server Activation Prevention), which is enabled by default.
Bring up your primary SAN or iSCSI target before bringing up your cluster servers. Then run the Cluster Scan For New Devices command from a secondary cluster server. Ensure that remirroring completes before bringing downed cluster servers back up.
If necessary, promote the former primary SAN back to primary. Then bring up the former primary cluster servers and fail back the cluster resources.
No additional response is necessary for this failure other than recovering the secondary cluster. When you bring the secondary cluster back up, the LUNs will still be in a secondary state to the primary SAN.
Bring up your secondary SAN or iSCSI target before bringing up your cluster servers. Then run the Cluster Scan For New Devices command on a primary cluster server to ensure remirroring takes place. When you bring the secondary cluster back up, the LUNs are still in a secondary state to the primary SAN.
If your primary SAN fails, all nodes in your primary cluster also fail. Bring up your primary SAN or iSCSI target and then bring up your cluster servers. Then run the Cluster Scan For New Devices command from a secondary cluster server. Ensure that remirroring completes before bringing downed cluster servers back up.
If necessary, promote the former primary SAN back to primary. You might need to demote the LUNs and resources to secondary on the primary SAN before bringing them back up. You can use the cluster resetresources console command to change resource states to offline and secondary. Bring up the former primary cluster servers and fail back the resources.
Bring up your secondary SAN or iSCSI target before bringing up your cluster servers. Then run the Cluster Scan For New Devices command on a primary cluster server to ensure that remirroring takes place. Then bring the secondary cluster back up. When you bring the secondary SAN and cluster back up, resources are still in a secondary state.
You must run the Cluster Scan For New Devices command on both clusters to ensure remirroring takes place. Recover your SANs first, then remirror from the good side to the bad side.
Users might not be able to access servers in the primary cluster but can possibly access servers in the secondary cluster. If both clusters are up, nothing additional is required. An error is displayed. Wait for connectivity to resume.
If you have configured the auto-failover feature, see Section B.0, Setting Up Auto-Failover.