Cleaning Your Directory of Elusive Obits
Novell Cool Solutions: Trench
By David Gersic
Digg This -
Posted: 18 Sep 2002
After reading David's post about stuck obits on the support forums, we contacted him and asked for his view "from the trenches". He was gracious enough to provide a full explanation. That's why we love our readers!
In a nutshell
Duplicate Creation Timestamps on objects can cause stuck obits with eDir 8.5 (DS.NLM v85.25z). The fun part is that you can't find them with DSRepair, as it doesn't seem to care about this anymore. But, if the obit has the same timestamp as a real object, the obit won't clear. I found several of these in our DS, including a couple of those Type 000C Flags 0002 (OK_TO_PURGE) ones that wouldn't go away on their own.
So, if you're trying to figure out why an obit won't purge, here's a new tip. Use DSBrowse to search for objects with the same Creation Timestamp as the obit object that is stuck. As far as I can tell, the only solution is to delete all of the objects involved in the conflict.
Other fun symptoms include rename collision (1_6, etc.) objects being created as DS tries to figure out what to do with this mess.
Stuck Obituaries are something that can happen for a variety of reasons. Most of them are pretty ordinary, and can be resolved by the standard methods of shifting the Master replica around the replica ring, using dsrepair -ot to re-time-stamp them to get them to process, and using dstrace to force the background processes to run. How to best resolve a stuck obit is dependant on the version of DS being used, and on the available options for that DS and DSRepair version. There's a TID on this, 10062149, that covers most ways to get rid of stuck obits and is generally pretty good.
What's not covered
Going back to DS historical information, though, there is a case that this TID does not cover, and that's the one I ran in to. With DS versions prior to DS8.x and 85.x (SKADS, TAO), there are two Really Important things about any one DS object: the Object ID number, and the Creation Timestamp.
The Object ID number is a server-specific 32-bit number that identifies the entry in the local DS database where this object's information can be found. These DS versions are known as "RecMan" database, the "record manager", and the Object ID is just the database record number (24-bit), with a use counter (8-bit) to ensure unique object ids.
The Creation Timestamp (timestamp) is a property of the DS object, stored in the database, and replicated to the other servers that hold copies of this object. Internally, it is used to uniquely identify the object for all servers. The timestamp is guaranteed unique across all servers by combining three pieces of information. The first is the date/time the object was created, the second is the replica number (ie: which server issued the create for the object), and the third is an event counter that the server increments each time it processes any DS operation within a single second.
So, the creation time stamp for an object would be something like "14 Aug 2002 09:15:42  ". Translated, this object was created 14 Aug 2002, at 9:15:42am, by the server that holds replica number 1 of the partition that this object exists in, and the create was the 19th DS transaction that the server processed that second. This timestamp is replicated to all servers that hold a copy of this object, so it can be used by the servers to uniquely identify that object later. In theory, no two objects can ever have the same timestamp, since that would mean that one server did two things at the same time, within the same event. But, there are some bugs that can lead to two objects having the same timestamp, which is where things get weird.
These two features (object id, timestamp) work together so that a process like renaming an object is no big deal. The object name is actually unimportant to maintaining the database's internal consistency. A delete, move, or rename event can be passed around to all of the relevant servers by specifying the timestamp of that object, and any other information needed for the event.
For example, a rename event passes the old name, the new name, and the timestamp, as a rename obituary. The servers involved can look at the old name, verify that it's correct via the timestamp, change the name to the new name, and verify that everything worked correctly so that the obit can be processed and cleared.
Since the server internally only cares about the timestamp, and the object id (file system rights in the Traditional File System), the object name and location in DS can be changed without the server caring very much.
A little history
Due to some problems over the years in Timesync.nlm, various bugs in DS and the operating system, there were some perceptions that timestamps were bad somehow. Microsoft made much ado about how "weak" this scheme was and how much better they would do with the release of MAD (Microsoft Active Directory) someday with a more robust "globally unique identifier". (As an aside, if you look under the covers of MAD, their "globally unique identifier" is essentially a timestamp and domain identifier combined into a bit of data and used as a GUID.) Novell took this criticism seriously, and with DS8 (SKADS) they introduced the GUID attribute to DS.
I don't think DS8 ever actually used the GUID for anything, but with eDir 8.5 and newer, you've been using it for more and more, and claims are made that the old Object ID and Timestamp are no longer used. (Again, as an aside, the GUID appears to be basically another timestamp type of thing, but I'm a programmer/engineer, not a marketing expert.)
Staying with the old for the moment, one of the things DSRepair used to check for is objects with duplicate timestamps. If it found any, it would throw up a complaint about it. The only resolution is to delete (all) objects with the duplicate timestamps, since the database cannot distinguish between them.
Moving forward, finally, and away from the history, the current versions of DSRepair no longer check for duplicate timestamps. There may be some in the database, but no errors or warnings are issued. It appears that the engineers have implemented the marketing claim that the timestamps are no longer used and everything is based on the GUID now.
Pinpointing the trouble
But, that's where trouble lies. It's not necessarily true. At least one process in DS (or eDir, if you prefer) still cares about the timestamps on the objects; the obituary process.
In a case where TID-10062149 has been applied, vigorously and repeatedly, and the obits still will not clear, especially the "Type C" obits that don't seem to hurt anything but are just hanging around and won't go away, this is a problem you can check for that is sufficiently obscure that there are probably only a few people on the planet that would even think to look for it. The only way I know of to do it is to use DSBrowse, since DSRepair no longer cares about the timestamps on the objects and will not report any problems with them, even if there is one.
Do your own check
To see if a timestamp problem is what is keeping DS from processing the obit:
- Use DSBrowse to look at the obit entry itself. Get the Creation Timestamp property from that object.
- Use the Object Search feature in DSBrowse to search for any object with this timestamp. Due to daylight savings time vs. UTC time, it sometimes takes a couple of tries to get the search right, but you will know if you have it right when you find the Obit object with your search. You will know if you have the problem I'm describing here if you find the Obit AND another object.
The road to resolution
The resolution has not changed. To clear the obit and fix the problem, all of the objects affected (found in #2, above) have to be deleted. The delete obits have to clear completely (check all servers). Then the objects can be re-created.
If you have questions for David, he can be e-mailed at: dgersicTAKETHISOUT@niu.edu