eDirectory Disaster Recovery
Novell Cool Solutions: Feature
By Jim Henderson
Digg This -
Posted: 24 Aug 2005
The following procedure outlines a simple process to rebuild a server that has crashed. This procedure is based on a BrainShare session from 2001, but has been updated to reflect steps that should be taken regardless of the platform on which eDirectory is running.
There are other options available - such as using the Hot Continuous Backup feature of eDirectory 8.7.3; this scenario assumes no backup exists of the database on the server, but other servers in the tree hold replicas of the partitions on the server that has crashed.
Here is a quick checklist of things to remember to do - the following sections will go into detail about each of these steps.
- Don't Panic!
- Reconfigure Time Synchronization
- Create a Temporary Placeholder Object
- Use SrvRef to Replace Server References
- Delete the Old Server Object
- Restore Server Specific Information
- Verify Replica Rings are Clean
- Rebuild and Re-Patch the Crashed Server
- Fix File System Trustee Assignments for SYS Volume
- Remove eDirectory if Installed into Temporary Tree
- Reconfigure Time Synchronization
- Install Rebuilt Server into Production Tree
- Restore Server References with SrvRef
- Re-issue Server Certificates
- Perform Post-Recovery Tasks
Readers familiar with the works of Douglas Adams may chuckle at this step, but it is offered as a serious part of a disaster recovery scenario, and is an absolutely critical step in the process.
Troubleshooting and disaster recovery scenarios are - without a doubt - high-stress situations. The directory is a core part of any business operation, because it allows services to authenticate users; without the directory, authentication stops working, and everything that depends on that authentication also stops working.
When a user's applications stop working, the user becomes idle - often times, the first instinct any user has is to find out why they are unable to complete tasks necessary for their jobs. The calls to the help desk start, and in many cases, users will know who is responsible for the systems in question and they will begin to seek out those individuals.
System administrators (both experienced and inexperienced) frequently react to this potentiality by scrambling to get the system back up. The amount of pressure that undoubtedly will be applied can create a sense of urgency to resolve the issue as quickly as possible. Unfortunately, this sense of urgency can be so intense as to result in steps being missed or bad decisions being made. It is important to have someone who can run interference with the users so the administrator can concentrate on the task at hand - getting the system operational with as little downtime as possible.
Making mistakes during a disaster recovery situation increases recovery time.
Time Synchronization is often misunderstood in an eDirectory environment. eDirectory itself does not provide time services - i.e., it is not a time services provider, but rather, it is a time services consumer. eDirectory depends on the time being correct in the platform in order to ensure timestamps are correct when applied to changes in the directory.
In order to ensure that events in the directory consistently are applied on all servers in a replica ring, time synchronization must be properly configured.
If the server that is being recovered is a time source for any other server in the tree, time services need to be reconfigured in order to provide consistent time during the recovery.
Time synchronization differs between the platforms; each of the major platforms will be considered individually.
NetWare allows the use of two types of time synchronization: a "legacy" TIMESYNC configuration that uses a time server type (single reference, reference, primary, or secondary), and an option to use an NTP configuration.
When using a legacy TIMESYNC configuration, determine if the down server was a SINGLE, REFERENCE, or PRIMARY time server; if it was, find the servers that used the down server as a time source and reconfigure them to point to an alternate source.
With an NTP configuration on NetWare, the time synchronization configuration is set using configured sources on each server. When using this configuration, the configured time source points to the server network address and must specify port 123 as part of the time source configuration:
MAGRATHEA:set timesync time sources
TIMESYNC Time Sources: 172.16.1.1:123;
Maximum length: 149
Description: This server contacts the servers in this list as time
providers. Each time server (IP Address, DNS Name) in this list
is separated by a ';'.For example :A ";" clears the
list"MyServer;" specifies that MyServer is the NetWare time
source."MyServer:123;" specifies that MyServer is a NTP time
In this example, the time source points to 172.16.1.1:123, which indicates the configuration is using NTP.
The Linux and Unix platforms use very similar configurations - timesync is typically implemented using NTP.
On SUSE Linux Enterprise Server 9, for example, the configuration is stored in /etc/ntp.conf. Changing the time synchronization configuration in this environment involves modifying this file's "server" line:
(File clipped to show just the relevant section that needs to be changed)
## Outside source of synchronized time
## server xx.xx.xx.xx # IP address of server
Configuration is similar on other Linux and Unix platforms.
The Windows platforms use a subset of NTP called SNTP (for "Simple Network Time Protocol"). This time protocol is compatible with NTP servers, so Windows servers can participate in the same time synchronization environment.
|NOTE: Windows time services are configured within a domain environment as well; designate one domain controller to receive time externally and allow Windows' time configuration to take care of the rest of your domain or forest if using a domain configuration.|
To change the time source in Windows, enter:
NET TIME /SETSNTP:172.16.1.1
This will change the time source to point at server 172.16.1.1.
Then restart the W32TIME service:
NET STOP W32TIME
NET START W32TIME
The concept of Referential Integrity refers to the ability to properly track links between objects in the directory. For example, if a user is a member of a group, it is important that that reference to the group object be maintained regardless of where the group is located in the directory or what its name is. eDirectory maintains this information by using a piece of data (the "entry ID" or EID) that is not tied to the object's name or location in the directory.
Think of the EID as being similar to a database's primary key for a row of data: if the row of data is deleted, the key is no longer valid, and the reference is invalidated.
When an object is deleted from the directory, the references to that object become invalid and are removed from the objects that contain the reference. In some cases, this creates a slight inconvenience, but in others (e.g., volume objects), the loss of the server object in the tree will result in a schema class violation, and the object will mutate into an unknown object.
To prevent this from happening, change the references to point to a temporary object. Because we want something that has no existing references to it, a new object must be created as a referrer before deleting the server object.
|WARNING: Do not point to an existing object! Existing objects will likely already be referenced, and it is not possible to distinguish which references were in place before the replacement was done.|
The class of the new object does not matter from a directory standpoint; the SrvRef utility uses a computer object for the replacement. This object class is a good choice, as it is rarely used in eDirectory implementations. ZENworks uses a different object class - the workstation class - for its purposes.
When preparing to present this topic at BrainShare 2001, Peter Kuo realized there was a need to perform a search and replace function on references in the directory. This process used to be part of the DSMAINT -PSE procedure (documented in TID10013535 (http://support.novell.com/cgi-bin/search/searchtid.cgi?/10013535.htm)) in the NetWare 4.11 days. The DSMAINT utility was not updated when NetWare 5 was released, but another tool called XBROWSE was created. XBROWSE was known to have some issues, which were noted in TID10013535.
Peter created the SrvRef (ftp://ftp.dreamlan.com/srvref.zip) utility for the purposes of the BrainShare presentation and makes it available for free.
When running the SrvRef utility, be sure you are authenticated to the tree; as shown in the screenshot, it uses the credentials for the logged-in user - in this case, the client was not authenticated to the tree.
To use the utility:
- Select "Replace Reference" from the listbox.
- Click the top "..." button to browse for the server that is being deleted.
- Click the middle "..." button to browse for the placeholder computer object.
- Click the bottom "..." button to select the container to start the search and replace operation from.
Normally, you should leave the "Scan subtree" option enabled. In larger trees, consider disabling this and running the utility against multiple containers in order to quicken the search.
Once the server references have been replaced, you can safely delete the server object. This is necessary in order to allow the server to be reinstalled into the tree as if it were a new server.
If the crashed server is a NetWare server, some additional documentation about the server's configuration is available for recovery. With the TSA used for filesystem backup, there is a target called Server Specific Information (or SSI) that contains the following files:
TID10062402 (http://support.novell.com/cgi-bin/search/searchtid.cgi?/10062402.htm) describes the contents of each of these files. For this procedure, use all of these files except for the SERVDATA.NDS file.
When restored, these files will be placed into a directory called SYS:SYSTEM\<servername>. For example, if the server's name is FS1, the path this server's SSI files will be restored to will be SYS:SYSTEM\FS1, regardless of the server to which data is restored.
Using DSREPAIR - and, if available, the DSMISC.LOG from the SSI restore - verify that the replica rings are healthy. As cited in the article Using the DSREPAIR Utility Appropriately (http://www.novell.com/coolsolutions/feature/15312.html), this is one of the circumstances where DSREPAIR should be used.
Correct any replica ring inconsistencies using DSREPAIR with the commands listed in the table below:
|DSREPAIR -A||Start DSREPAIR.DLM with a -A||ndsrepair -P -Ad|
Clearing the replica ring inconsistencies involves forcibly removing the server from the replica ring - this should be handled by step 5, but verification is very important. If the server needs to be removed, view the servers in the replica ring(s) affected, and select the option to remove the server from the ring. If the crashed server held the master replica for any partition, use the DSREPAIR utility to designate a new server as the master for the ring; after the recovery is complete, use the standard administration tools (not DSREPAIR) to move the master to the crashed server. Remember that DSREPAIR should be used for this type of operation only when the standard utilities fail to do so, and only after diagnosing the problem.
Rebuild the server using configuration information restored from SSI or as found in the documentation.
|NOTE: For NetWare servers only, the server must be installed into a temporary tree in order to complete the installation. This step is only necessary on NetWare because the eDirectory installation is an integral component of the OS installation; with the other platforms, patches can be applied prior to the eDirectory configuration.|
This rebuild should include all patches that were on the system prior to the crash.
|WARNING: The server must be rebuilt utilizing the configuration in use prior to the server crash. This is not the time to change IP addresses, rename the server, or perform other maintenance tasks.|
|NOTE: This step applies to NetWare servers only.|
Trustee assignments on Traditional File System (TFS) and NetWare 5's version of NSS used the EID of objects stored locally on the server. If the object did not exist in a local replica, an external reference was created so there would be an EID on the local server for this purpose.
When rebuilding a server like this, any pre-existing filesystem trustee rights will be assigned to EIDs set up for [Public], [Root], and other local IDs. When the DS is removed from the server in the next step, those file system trustees are orphaned. A reinstall of the directory can cause unusual filesystem trustee assignments to appear, because different EIDs are used for those objects. This makes it necessary to delete the existing trustee assignments.
With NSS on NetWare 6 and later, the GUID attribute is used by the filesystem to hold the trustee assignment information. However, the GUIDs will no longer be valid (or will be assigned to other objects in the tree) because this server was installed into a temporary tree, so even with filesystem rights managed by GUID, this step is still necessary.
If recovering a lost SYS volume, LOAD DSREPAIR -XK6 and select "Check Volume Objects and Trustees" in the advanced menu. When prompted to make the change on the SYS volume, answer no. For all other volumes, answer yes.
|NOTE: This step applies to NetWare only.|
In order to reinstall the server into the production tree, it is necessary to remove the server from its temporary tree.
If in Step 2 it was necessary to make changes to the environment's time synchronization configuration, this is a good time to set the configuration as it was prior to the start of recovery. This is not a required step, but one goal of a DR procedure should be to restore the initial configuration.
At this point, everything should be ready for the server's re-installation into the tree. Perform one last health check on the production tree using the procedures in the article Using iMonitor to Perform eDirectory Health Checks (http://www.novell.com/coolsolutions/feature/15336.html) or TID10060600 (http://support.novell.com/cgi-bin/search/searchtid.cgi?/10060600.htm) if iMonitor is not available in the environment. The iMonitor method is much faster - especially in a large environment - so time used to become familiar with the method is time well spent.
The final step in the process of recovering the base server and eDirectory is to restore the server references saved in Step 4 using the SrvRef utility. This process is performed in exactly the same way as the original replacement, but select the option Restore reference from the listbox. Be sure to perform the restore operation everywhere references were replaced - if you performed multiple replace operations in different parts of your tree, be sure all of the references have been restored.
Once this step is complete, delete the temporary computer object that was used by SrvRef.
Server certificates are created in the process of a normal installation of eDirectory. In some cases, this step may be unnecessary, but you should verify that the certificates that exist for the server are valid.
If the server being recovered is the certificate authority for the tree, this step can be much more involved:
- Delete the existing CA
- Create a new CA
- Re-issue all certificates in the tree
The last step can be done over time - the certificates will not be invalidated by the CA being replaced, but the certificates will be non-verifiable because the signing CA no longer exists. Eventually, they should be replaced with freshly issued certificates.
If the server is not the CA for the tree, issue new certificates only for the server itself. The SSL CertificateIP and SSL CertificateDNS certificates will likely need to be reissued.
15. Perform Post-Recovery Tasks
Now that the recovery of the eDirectory server is completed, the total recovery of the server can be concluded by performing various recovery tasks.
- Restore Data and Trustee Information
- Re-install Server-based Applications
- Re-establish Replica Information
Restore Data and Trustee Information
When restoring data on a NetWare server, the option exists in most backup software to perform a data restore or a trustee restore. For the SYS volume, it is recommended that trustees be restored unless there is application data that needs to be restored. For other volumes, it depends on whether the volume had to be recovered. For example, if only the SYS volume on a server was lost, there's no need to restore the data for the other volumes - but you may wish to restore trustee information, depending on whether the trustee assignments were preserved properly. Chances are good if using NetWare 6 or later, the trustee information is fine, as that information is based on object GUIDs rather than EIDs. If the trustees appear not to have recovered properly, use the DSREPAIR -XK6 option outlined in step 9 on each volume affected and then perform a trustee-only restore.
Re-install Server-based Applications
If there are server-based applications installed on the server - anti-virus, backup, firewall, or others - restore these as well. When this step is performed depends on the criticality of the application to the business; less important applications may be able to wait until after data is restored. More important applications may need to be installed as soon as possible.
Re-establish Replica Information
Using iManager, ConsoleOne, or NDS Manager and information from the DSMISC.LOG file from SSI (if this is available), replace the replicas that were on the server and set the replica types as appropriate. This step can take some time to complete and may impact performance of the server; it may be advisable this after hours, depending on the number of replicas and their sizes.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com