What's Causing Slow Performance?
Novell Cool Solutions: Feature
By Subbu K.K.
Digg This -
Posted: 30 Sep 2003
Slowdowns can be really annoying, especially when the root cause is not obvious. eDirectory has many background processes that kick in during lean times (i.e. no searches or updates). These processes scan through the cache processing entries in change cache or index entries. With a large cache with hundreds of thousands of entries, the cpu can really get pegged for a looong time. FLAIM also has internal processing threads which tend to soak up cpu time for sorting, timestamping and indexing. eDirectory is really designed to be a distributed server. Any change will have to ripple out to peer replica servers completely before being flushed out of the updated server.
I usually look out for a variety of clues using iMonitor, vmstat and top. Here is something to go with (in order of priority):
1. Check Agent Activity page. Look out for the DIB operations and background processing tasks. See which ones are active. Check out the dib statistics. You may have to refresh the screen every minute to rule out spikes - is the agent doing searches, DIB writes, skulking or bg processing? bg threads may become cpu bound if the cache is large. Once the processing is over, the cpu will come back to normal.
2. Check the Change Cache for every partition. CC is the log which contains entries modified but not yet fully synced in the ring. If this is in hundreds of thousands, throttle further updates till this log drains out. Look for a slow server agent in the ring that is holding up sync. The longer a changed entry lies around in the change cache, the longer bg threads like janitor, purger will have to check its timestamps. If you see high cpu utilization but very little disk i/o (except for a checkpoint spike every 3 minutes), bg threads can be suspected. When this happens in a multi-member ring, all other replica agents will be waiting on the slow poke to finish its sync. Restarting the server agent does not help in this case.
3. Sometimes, dropping the cache RAM by 50%-75% for a few minutes (say 10-15mins) and then restoring it to its original value helps. My theory is that starving cache leads to older values (that are no longer needed) to be eased off the cache and newly added space is then used to cache latest values. In steady state, this is taken care of automatically by the software, but I have seen manual nudging help in extreme situations like burst updates. This is the 'bouncing cache' trick.
4. The memory allocator used in 8.7.0 is not efficient under some data combinations. 8.7.1 (Falcon SP1) uses a more tolerant memory allocator. There is no easy way to check this on a production server. Contact NTS/WWS if you dont see much bg activity or large CC and the indexes are all online, but a bunch of LDAP queries triggers sluggish behavior. Meanwhile, restarting the server may improve the situation.
5. An obscure defect in 8.6.2 is known to cause high cpu utilization. This was fixed in 8.7. Perhaps, there could be other conditions under which the defect may reemerge. Work with NTS/WWS to resolve it.
It helps when reporting such cases to see the dib stats, change cache and agent activity. You may adapt the count-change-cache script to grab it unattended.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com