Ldif2dib - Offline Bulkload Tool for eDirectory
Novell Cool Solutions: AppNote
By Hiredesai Santosh, Piyush Janawadkar
Digg This -
Posted: 26 Jul 2006
One of the first steps in a directory deployment is to populate the directory with all the objects and identities that will later be used by directory-enabled enterprise applications. This makes bulkload a key performance indicator for any directory, along with other factors such as search, reliability, etc. A long-drawn bulkload process could seriously prolong the schedule of a directory deployment or IDM (Identity Management) project. This article talks about a new bulkload utility for eDirectory, christened "ldif2dib" - which is intended to be a silver bullet for these problems.
Bulkload in Earlier versions of eDirectory
The Internet standard LDAP protocol, which is the primary mode of accessing directories, provides the add operation to create a new entry in the directory. However, the performance of tools such as ldapmodify and ldapadd, which use add operation to load entries in the directory from an LDIF file, has been questionable. They are found to be underperforming when subjected to an initial population of a large number of objects in a limited amount of time, let alone bulk numbers which grow over time. The sequential processing of the LDIF entries performed by these tools has largely been the performance inhibitor.
Figure 1: Using LDAP tools to add objects to eDirectory
To overcome slow performance, most directories have provided special tools tuned to bulkload a large number of objects at very high speeds. In the past, eDirectory has been equiped with different bulkload tools. Ndsbulkload, provided until eDirectory 8.5, allowed for populating eDirectory (then NDS eDirectory) from an LDIF file, using NDAP (Novell Directory Access Protocol) to populate the users in the eDirectory.
Ndsbulkload was replaced by ICE (Import Convert Export), a multi-purpose tool with additional functions. ICE provided a bulkload operation over an LBURP (LDAP Bulk Update Replication Protocol, http://www.ietf.org/rfc/rfc4373.txt) protocol. To this day, it remains the preferred tool to bulkload large number of objects into eDirectory. Visit http://www.novell.com/documentation/edir88/index.html?page=/documentation/edir88/edir88/data/a5hf8rg.html for more information on the ICE utility.
Let's elaborate a little more on the ICE modus operandi. ICE makes an LDAP request to the server for the transaction size. Transaction size, a configurable parameter at the server, actually gives the number of objects the server can process in one instance. Once the ICE client has this information, it forms LBURP packets, each containing as many objects as the transaction size permits. The client then sends them to the server in asynchronous fashion. This improves the performance for a couple of reasons:
- With asynchronous calls the ICE client need not wait for the server response to trigger the processing.
- A chunk of entries is sent to the server at one shot, which eliminates the overhead of sequential processing of each entry employed by ldapmodify or ldapadd.
Figure 2: Using ICE to add objects to eDirectory
Though ICE sizeably outperforms ldapmodify/ldapadd, its performance is limited by the following factors:
- The LDAP server requires performing several tasks before adding the object to the DIB, such as authorization checks using ACLs to ensure the user adding the objects has the sufficient rights for the operation, and schema checking to ensure data being imported is valid as per the directory schema.
- The data format (the LDAP standard in UTF8) needs to be converted to the NDS format. The NDS format supports Unicode and mapping from LDAP schema names to NDS schema names. Because ICE is an LDAP client, the underlying network performance comes into picture.
ldif2dib: The new bulk load tool
As directory usage has become more high-scale, it's common to see large directory deployments in the enterprise and carrier grade market that need to grow to millions of objects. Because existing eDirectory bulkload tools (ldapmodify, ldapadd, ICE) take a long time to load millions of objects, it was necessary to add a high-speed tool specially designed for loading millions of objects in a short period of time.
The new ldif2dib bulkload tool was designed to avoid some of the limitations mentioned above and import the objects directly to the database. ldif2dib, as the name suggests, is a utility that reads data from the LDIF file and writes directly to the Dib (Directory Information Base), bypassing major processing overhead on the server.
Figure 3: Direct access from ldif2dib to the eDirectory database
This tool works in an offline mode, which implies that the eDirectory server should be shut down during the bulkload operation. The utility reads the objects from an LDIF file and creates database entries by populating operational and user-specified attributes. The utility uses a multi-threaded loading engine, thereby enabling parallel reading of the LDIF entries and their eventual population in the Dib. The utility provides several command-line options to alter the bulkload behavior, as well as a UI that displays the progress of the bulkload operation and the associated parameters. The utility is supported on Linux, Solaris, AIX and Windows in the eDirectory 8.8 SP1 release.
Figure 4: Command line options of ldif2dib
Figure 5: ldif2dib UI during the bulkload
Interplay of Database Tunables with Bulkload
One of the important variables that controls the speed of the bulkload operation is the database cache. For a detailed treatise on the database cache in eDirectory, refer to:
There is usually a positive correlation between the cache available for bulkload (this can be configured using the -c command line option) and the speed of loading. This is because higher cache values imply that the entries being loaded could be held in the database cache instead of being written to the disk intermittently. Intermittent disk I/O (as opposed to once at the end of the bulkload operation) can be expensive, depending on the file system and physical storage device used.
The cache value provided via the -c option is pre-allocated by ldif2dib. If the -c option is not used, ldif2dib will acquire memory dynamically for the cache. The block cache is used to hold the memory images of the data blocks present on the disk, while the entry cache holds the logical representation of the directory objects. Having higher block cache facilitates faster adds, while higher entry cache values serves retrieval/reads better. Hence, configuring a higher block cache (up to 90% using -p switch) is beneficial, as addition of the entries requires more block cache than the entry cache.
The eDirectory database performs periodic checkpoints to minimize recovery time after a system crash. The checkpoint interval is the time the database waits before it initiates the checkpoint background thread. This brings the on-disk version of the database to the same coherent state as the in-memory (cached) database. This also helps in recovering the database from a crash to the last consistent state. The checkpoint thread flushes the dirty cache to the disk, followed by cleaning up the roll forward log. This interval is configurable through the -i switch. However, it is not recommended to indefinitely postpone running the checkpoint, as the database roll-forward log could grow to a large value and hit the 4 GB limit, causing ldif2dib to error out.
Transaction size is the total number of objects added to the database in a commit operation. Smaller transaction sizes lead to frequent commits, which means frequent disk accesses for writing out the entries committed. Therefore, higher transaction sizes are recommended for better performance. The ldif2dib -t option specifies the transaction size. A value of 0 implies the commit is done only once after the entire LDIF file is loaded; this has been found to give the best performance. However, while bulkloading a large number of entries, it is not recommended to set an extremely large transaction size. If the ldif2dib process crashes, it will cause the loss of all data loaded before the crash.
ldif2dib and Replication
eDirectory replication works on the basis of a change cache, which is the set of objects modified/added after the last successful sync point. This set of cached objects and the related changes are sent across to the servers present in the replica ring for updation. Refer to the eDirectory Administration Guide section "Server Synchronization in Replica Ring" for details on eDirectory replication:
The ldif2dib -r command-line switch populates the change cache at the time when it adds new objects to the Dib. This helps in faster replication with the availability of the ready change cache. Otherwise, this would take time to build after the server is started on the bulkloaded database.
ldif2dib supports the following password imports:
- NMAS Simple Password: Includes clear text password or hashed password (SSHA, SHA1, MD5, crypt.) This requires the NMAS simple password method be installed before the import.
- NDS Password: Storage of this password involves generation and storage of RSA key-pair (public and private keys). It is expensive to generate the key-pair, and as a result it is slower compared to other types of passwords. The -w switch generates the NDS password for the User Password attribute.
- Universal Password (UP): Universal passwords are designed to address the limitations of NDS Password and Simple Password. Visit http://www.novell.com/documentation/nmas23/index.html?page=/documentation/nmas23/admin/data/allq21t.html for more information on Universal Password.
To use ldif2dib with UP, a UP policy must be defined. UP policy provides options to synchronize Universal Password with NDS Password and Simple Password. With password synchronization turned on, setting UP sets both the passwords. Once UP is enabled and set, the Simple Password login and NDS login methods of NMAS always use UP. The administrator can define password policies and set the type of characters that are allowed in the password. Some of the policies available with UP are:
- Minimum/maximum characters
- Repeatable/consecutive characters
- Exclude list
- Expiration settings
- Numeric/special characters, such as !@#$%^`&*()
- Requirement for unique passwords
- Forgotten passwords
Processing objects with a User Password is expected to be slow because of the extra encryption or key-pair generation. These are computation-expensive operations, as compared to the processing of any other attributes.
Indexes are used by eDirectory to improve the search performance. Indexes are dynamically updated when objects are added, modified, or deleted. An index by itself is a set of keys in a sorted order, referring to the objects that contain indexed attributes. Bulkloading an indexed database proves very costly because of the overhead in updating the indexes. ldif2dib provides a command line switch to suspend the indexing during bulkload and resume it once the bulkload is complete. After the eDirectory server is started on the bulkloaded database, a background thread accomplishes the job of updating the indexes. This improves performance of the bulkload, as the overhead of updating the indexes is eliminated until the end of bulkload operation.
Super Performance Numbers
This section discusses the results of the bulkload performance test conducted in a Novell lab. A SLES 9 server was used, running on a two-processor, 2.4 GHZ CPU with 2 GB RAM and an IBM-ESXS Model CBR146C34810ESFN hard disk.
|Tool||Time taken in HH:MM:SS for 1Million import w/o passwords||Entries Added per second||Tuning parameters|
|ldapadd/ldapmodify||12:30:45||22.2||1.5 GB cache, block cache 90%|
|ICE||2:11:16||127.0||1.5 GB cache, block cache 90%, trans size 250 , writer threads 3|
|ldif2dib||0:22:47||731.5||1.5 GB cache, trans size 5000 , block cache 90%|
Table 1: ldif2dib import time in comparison with ldap tools and ICE
As can be seen, ldif2dib performs several fold faster bulkload than ICE and ldapmodify/ldapadd.
|LDIF Size||Password (Yes/No)||Indexing (On/Off)||ObjectClass||Time taken in HH:MM:SS||Entries Added per second|
Table 2: ldif2dib import times with different configurable parameters and ldif files
The above table shows the effect of password import and indexing on the speed of the bulkload using ldif2dib.
Limitations of ldif2dib
The superior performance of ldif2dib comes with a few corners cut. Here are the limitations:
- Because the tool operates in offline mode, the server should be brought down during the bulkload which results in some server down time.
- As ldif2dib performs a minimal schema checking, it is the administrator's responsibility to ensure the data in the LDIF file meets the schema requirements.
- Since no explicit authentication is required, anyone having file level access to the Dib can use this utility.
- ldif2dib must run on the local machine where the Dib is present.
- As of this writing, the tool is not available on the NetWare platform.
As the performance numbers indicate, ldif2dib is the sledge hammer in the collection of eDirectory bulkload tools. With ldif2dib, long bulkload times should be a thing of the past in the eDirectory deployments. In IDM deployments, the initial synchronization time into the eDirectory from another data source can be reduced by exporting the data in LDIF format and adding it into eDirectory using ldif2dib.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com